mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
wordfreq_builder: Document the extract_reddit pipeline
Former-commit-id: 88626aafee
This commit is contained in:
parent
046ca4cda3
commit
8d09b68d37
@ -103,5 +103,12 @@ rule freqs2cB
|
||||
rule cat
|
||||
command = cat $in > $out
|
||||
|
||||
# A pipeline that extracts text from Reddit comments:
|
||||
# - Unzip the input files
|
||||
# - Select the body of comments, but only those whose Reddit score is positive
|
||||
# (skipping the downvoted ones)
|
||||
# - Skip deleted comments
|
||||
# - Replace HTML escapes
|
||||
rule extract_reddit
|
||||
command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/>/>/g' | sed 's/</</g' | sed 's/&/\&/g' > $out
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user