From 88626aafee87ccfacd45366a2d67ec4647bef2aa Mon Sep 17 00:00:00 2001 From: Rob Speer Date: Thu, 2 Jun 2016 15:19:25 -0400 Subject: [PATCH] wordfreq_builder: Document the extract_reddit pipeline --- wordfreq_builder/rules.ninja | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/wordfreq_builder/rules.ninja b/wordfreq_builder/rules.ninja index b43ebeb..3f4277b 100644 --- a/wordfreq_builder/rules.ninja +++ b/wordfreq_builder/rules.ninja @@ -103,5 +103,12 @@ rule freqs2cB rule cat command = cat $in > $out +# A pipeline that extracts text from Reddit comments: +# - Unzip the input files +# - Select the body of comments, but only those whose Reddit score is positive +# (skipping the downvoted ones) +# - Skip deleted comments +# - Replace HTML escapes rule extract_reddit command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/>/>/g' | sed 's/</ $out +