2015-04-29 19:59:06 +00:00
|
|
|
# This defines the rules on how to build parts of the wordfreq lists, using the
|
|
|
|
# Ninja build system:
|
|
|
|
#
|
|
|
|
# http://martine.github.io/ninja/manual.html
|
2015-04-29 21:13:58 +00:00
|
|
|
#
|
2015-04-29 19:59:06 +00:00
|
|
|
# Ninja is available in the 'ninja-build' Ubuntu package. It's like make with
|
|
|
|
# better parallelism and the ability for build steps to produce multiple
|
|
|
|
# outputs. The tradeoff is that its rule syntax isn't full of magic for
|
|
|
|
# expanding wildcards and finding dependencies, so in general you have to
|
|
|
|
# write the dependencies using a script.
|
|
|
|
#
|
|
|
|
# This file will become the header of the larger build.ninja file, which also
|
|
|
|
# contains the programatically-defined dependency graph.
|
|
|
|
|
|
|
|
# Variables
|
2015-11-30 21:38:11 +00:00
|
|
|
JQ = lib/jq-linux64
|
2015-04-29 19:59:06 +00:00
|
|
|
|
2015-05-08 16:39:31 +00:00
|
|
|
# How to build the build.ninja file itself. (Use the Makefile to get it the
|
|
|
|
# first time.)
|
|
|
|
rule build_deps
|
|
|
|
command = python -m wordfreq_builder.cli.build_deps $in > $out
|
|
|
|
|
2015-04-29 19:59:06 +00:00
|
|
|
# Splits the single file $in into $slices parts, whose names will be
|
|
|
|
# $prefix plus a two-digit numeric suffix.
|
|
|
|
rule split
|
2015-04-30 20:24:28 +00:00
|
|
|
command = mkdir -p $$(dirname $prefix) && split -d -n r/$slices $in $prefix
|
2015-04-29 19:59:06 +00:00
|
|
|
|
2015-05-05 17:59:21 +00:00
|
|
|
# wiki2text is a tool I wrote using Nim 0.11, which extracts plain text from
|
|
|
|
# Wikipedia dumps obtained from dumps.wikimedia.org. The code is at
|
|
|
|
# https://github.com/rspeer/wiki2text.
|
2015-04-29 19:59:06 +00:00
|
|
|
rule wiki2text
|
2015-07-17 18:45:22 +00:00
|
|
|
command = bunzip2 -c $in | wiki2text > $out
|
2015-04-29 21:13:58 +00:00
|
|
|
|
2015-05-26 22:08:04 +00:00
|
|
|
# To tokenize Japanese, we run it through Mecab and take the first column.
|
2015-05-07 20:49:53 +00:00
|
|
|
rule tokenize_japanese
|
2015-07-17 18:45:22 +00:00
|
|
|
command = mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out
|
2015-05-07 20:49:53 +00:00
|
|
|
|
2015-09-04 20:59:11 +00:00
|
|
|
# Process Chinese by converting all Traditional Chinese characters to
|
|
|
|
# Simplified equivalents -- not because that's a good way to get readable
|
|
|
|
# text, but because that's how we're going to look them up.
|
|
|
|
rule simplify_chinese
|
|
|
|
command = python -m wordfreq_builder.cli.simplify_chinese < $in > $out
|
|
|
|
|
2015-06-30 19:18:37 +00:00
|
|
|
# Tokenizing text from Twitter requires us to language-detect and tokenize
|
|
|
|
# in the same step.
|
2015-04-29 21:13:58 +00:00
|
|
|
rule tokenize_twitter
|
2015-06-16 21:28:09 +00:00
|
|
|
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix
|
2015-05-07 20:49:53 +00:00
|
|
|
|
2016-03-24 20:27:24 +00:00
|
|
|
rule tokenize_reddit
|
|
|
|
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_reddit $in $prefix
|
|
|
|
|
2015-05-07 20:49:53 +00:00
|
|
|
# To convert the Leeds corpus, look for space-separated lines that start with
|
|
|
|
# an integer and a decimal. The integer is the rank, which we discard. The
|
|
|
|
# decimal is the frequency, and the remaining text is the term. Use sed -n
|
|
|
|
# with /p to output only lines where the match was successful.
|
2015-05-26 22:08:04 +00:00
|
|
|
#
|
|
|
|
# Grep out the term "EOS", an indication that Leeds used MeCab and didn't
|
|
|
|
# strip out the EOS lines.
|
2015-05-07 20:49:53 +00:00
|
|
|
rule convert_leeds
|
2015-07-17 18:45:22 +00:00
|
|
|
command = sed -rn 's/([0-9]+) ([0-9.]+) (.*)/\3,\2/p' < $in | grep -v 'EOS,' > $out
|
2015-05-07 20:49:53 +00:00
|
|
|
|
|
|
|
# To convert the OpenSubtitles frequency data, simply replace spaces with
|
|
|
|
# commas.
|
|
|
|
rule convert_opensubtitles
|
2015-07-17 18:45:22 +00:00
|
|
|
command = tr ' ' ',' < $in > $out
|
2015-04-29 21:13:58 +00:00
|
|
|
|
2015-09-04 16:37:35 +00:00
|
|
|
# To convert SUBTLEX, we take the 1st and Nth columns, strip the header,
|
|
|
|
# run it through ftfy, convert tabs to commas and spurious CSV formatting to
|
2015-09-24 21:56:06 +00:00
|
|
|
# spaces, and remove lines with unfixable half-mojibake.
|
2015-09-03 22:10:54 +00:00
|
|
|
rule convert_subtlex
|
2015-09-04 16:37:35 +00:00
|
|
|
command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr ' ",' ', ' | grep -v 'â,' > $out
|
2015-09-03 22:10:54 +00:00
|
|
|
|
2015-09-05 07:16:56 +00:00
|
|
|
rule convert_jieba
|
|
|
|
command = cut -d ' ' -f 1,2 $in | grep -v '[,"]' | tr ' ' ',' > $out
|
|
|
|
|
|
|
|
rule counts_to_jieba
|
|
|
|
command = python -m wordfreq_builder.cli.counts_to_jieba $in $out
|
|
|
|
|
|
|
|
|
2015-05-11 22:44:28 +00:00
|
|
|
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
|
|
|
|
# the input files, keep only the single words and their counts, and only keep
|
|
|
|
# lines with counts of 100 or more.
|
|
|
|
#
|
|
|
|
# (These will still be repeated as the word appears in different grammatical
|
|
|
|
# roles, information that the source data provides that we're discarding. The
|
|
|
|
# source data was already filtered to only show words in roles with at least
|
|
|
|
# two-digit counts of occurences.)
|
|
|
|
rule convert_google_syntactic_ngrams
|
2015-07-17 18:45:22 +00:00
|
|
|
command = zcat $in | cut -f 1,3 | grep -v '[,"]' | sed -rn 's/(.*)\s(...+)/\1,\2/p' > $out
|
2015-05-11 22:44:28 +00:00
|
|
|
|
2015-05-05 17:59:21 +00:00
|
|
|
rule count
|
2015-07-22 14:04:14 +00:00
|
|
|
command = python -m wordfreq_builder.cli.count_tokens $in $out
|
2015-05-05 17:59:21 +00:00
|
|
|
|
2016-07-28 23:23:17 +00:00
|
|
|
rule count_langtagged
|
|
|
|
command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
|
|
|
|
|
2015-05-07 23:38:33 +00:00
|
|
|
rule merge
|
2015-09-08 16:49:21 +00:00
|
|
|
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
|
2015-09-03 22:10:54 +00:00
|
|
|
|
|
|
|
rule merge_counts
|
2015-12-15 19:44:34 +00:00
|
|
|
command = python -m wordfreq_builder.cli.merge_counts -o $out -c $cutoff $in
|
2015-05-07 23:38:33 +00:00
|
|
|
|
2015-06-22 21:38:13 +00:00
|
|
|
rule freqs2cB
|
2016-01-12 19:05:17 +00:00
|
|
|
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out -b $buckets
|
2015-05-07 23:38:33 +00:00
|
|
|
|
2015-04-29 21:13:58 +00:00
|
|
|
rule cat
|
|
|
|
command = cat $in > $out
|
2015-11-30 21:38:11 +00:00
|
|
|
|
2016-06-02 19:19:25 +00:00
|
|
|
# A pipeline that extracts text from Reddit comments:
|
|
|
|
# - Unzip the input files
|
|
|
|
# - Select the body of comments, but only those whose Reddit score is positive
|
|
|
|
# (skipping the downvoted ones)
|
|
|
|
# - Skip deleted comments
|
|
|
|
# - Replace HTML escapes
|
2015-11-30 21:38:11 +00:00
|
|
|
rule extract_reddit
|
2016-03-24 22:05:13 +00:00
|
|
|
command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/>/>/g' | sed 's/</</g' | sed 's/&/\&/g' > $out
|
2016-06-02 19:19:25 +00:00
|
|
|
|