wordfreq/wordfreq_builder/rules.ninja

# This defines the rules on how to build parts of the wordfreq lists, using the
# Ninja build system:
#
#   http://martine.github.io/ninja/manual.html
#
# Ninja is available in the 'ninja-build' Ubuntu package. It's like make with
# better parallelism and the ability for build steps to produce multiple
# outputs. The tradeoff is that its rule syntax isn't full of magic for
# expanding wildcards and finding dependencies, so in general you have to
# write the dependencies using a script.
#
# This file will become the header of the larger build.ninja file, which also
# contains the programatically-defined dependency graph.

# Variables
DATA = ./data

# How to build the build.ninja file itself. (Use the Makefile to get it the
# first time.)
rule build_deps
  command = python -m wordfreq_builder.cli.build_deps $in > $out

# Splits the single file $in into $slices parts, whose names will be
# $prefix plus a two-digit numeric suffix.
rule split
  command = mkdir -p $$(dirname $prefix) && split -d -n r/$slices $in $prefix

# wiki2text is a tool I wrote using Nim 0.11, which extracts plain text from
# Wikipedia dumps obtained from dumps.wikimedia.org.  The code is at
# https://github.com/rspeer/wiki2text.
rule wiki2text
  command = mkdir -p $$(dirname $out) && bunzip2 -c $in | wiki2text > $out

# The wiki2tokens rule is the same as the wiki2text rule, but uses the -t
# flag to tell the Nim code to output one token per line (according to its
# language-agnostic tokenizer, which splits on punctuation and whitespace in
# basically the same way as wordfreq).
#
# The fact that this uses a language-agnostic tokenizer means it should not
# be applied to Chinese or Japanese.
rule wiki2tokens
  command = mkdir -p $$(dirname $out) && bunzip2 -c $in | wiki2text -t > $out

# To tokenize Japanese, we run it through Mecab and take the first column.
# We don't have a plan for tokenizing Chinese yet.
rule tokenize_japanese
  command = mkdir -p $$(dirname $out) && mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out

# Tokenizing text from Twitter generally requires us to use a more powerful
# tokenizer than the language-agnostic one.
#
# Our usual build process does not use this step. It just assumes it's already
# done, because it takes a very long time. This is what the 'data/intermediate'
# directory contains.
rule tokenize_twitter
  command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix

# To convert the Leeds corpus, look for space-separated lines that start with
# an integer and a decimal. The integer is the rank, which we discard. The
# decimal is the frequency, and the remaining text is the term. Use sed -n
# with /p to output only lines where the match was successful.
#
# Grep out the term "EOS", an indication that Leeds used MeCab and didn't
# strip out the EOS lines.
rule convert_leeds
  command = mkdir -p $$(dirname $out) && sed -rn 's/([0-9]+) ([0-9.]+) (.*)/\3,\2/p' < $in | grep -v 'EOS,' > $out

# To convert the OpenSubtitles frequency data, simply replace spaces with
# commas.
rule convert_opensubtitles
  command = mkdir -p $$(dirname $out) && tr ' ' ',' < $in > $out

# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
# the input files, keep only the single words and their counts, and only keep
# lines with counts of 100 or more.
#
# (These will still be repeated as the word appears in different grammatical
# roles, information that the source data provides that we're discarding. The
# source data was already filtered to only show words in roles with at least
# two-digit counts of occurences.)
rule convert_google_syntactic_ngrams
  command = mkdir -p $$(dirname $out) && zcat $in | cut -f 1,3 | grep -v '[,"]' | sed -rn 's/(.*)\s(...+)/\1,\2/p' > $out

rule count
  command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.count_tokens $in $out

rule merge
  command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.combine_lists -o $out $in

rule freqs2dB
  command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.freqs_to_dB $in $out

rule cat
  command = cat $in > $out
WIP on Ninja build automation 2015-04-29 19:59:06 +00:00			`# This defines the rules on how to build parts of the wordfreq lists, using the`
			`# Ninja build system:`
			`#`
			`# http://martine.github.io/ninja/manual.html`
define some ninja rules 2015-04-29 21:13:58 +00:00			`#`
WIP on Ninja build automation 2015-04-29 19:59:06 +00:00			`# Ninja is available in the 'ninja-build' Ubuntu package. It's like make with`
			`# better parallelism and the ability for build steps to produce multiple`
			`# outputs. The tradeoff is that its rule syntax isn't full of magic for`
			`# expanding wildcards and finding dependencies, so in general you have to`
			`# write the dependencies using a script.`
			`#`
			`# This file will become the header of the larger build.ninja file, which also`
			`# contains the programatically-defined dependency graph.`

			`# Variables`
			`DATA = ./data`

Makefile should only be needed for bootstrapping Ninja 2015-05-08 16:39:31 +00:00			`# How to build the build.ninja file itself. (Use the Makefile to get it the`
			`# first time.)`
			`rule build_deps`
			`command = python -m wordfreq_builder.cli.build_deps $in > $out`

WIP on Ninja build automation 2015-04-29 19:59:06 +00:00			`# Splits the single file $in into $slices parts, whose names will be`
			`# $prefix plus a two-digit numeric suffix.`
			`rule split`
WIP on new build system 2015-04-30 20:24:28 +00:00			`command = mkdir -p $$(dirname $prefix) && split -d -n r/$slices $in $prefix`
WIP on Ninja build automation 2015-04-29 19:59:06 +00:00
add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary 2015-05-05 17:59:21 +00:00			`# wiki2text is a tool I wrote using Nim 0.11, which extracts plain text from`
			`# Wikipedia dumps obtained from dumps.wikimedia.org. The code is at`
			`# https://github.com/rspeer/wiki2text.`
WIP on Ninja build automation 2015-04-29 19:59:06 +00:00			`rule wiki2text`
define some ninja rules 2015-04-29 21:13:58 +00:00			`command = mkdir -p $$(dirname $out) && bunzip2 -c $in \| wiki2text > $out`

correct a Leeds bug; add some comments to rules.ninja 2015-05-26 22:08:04 +00:00			`# The wiki2tokens rule is the same as the wiki2text rule, but uses the -t`
			`# flag to tell the Nim code to output one token per line (according to its`
			`# language-agnostic tokenizer, which splits on punctuation and whitespace in`
			`# basically the same way as wordfreq).`
			`#`
			`# The fact that this uses a language-agnostic tokenizer means it should not`
			`# be applied to Chinese or Japanese.`
add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary 2015-05-05 17:59:21 +00:00			`rule wiki2tokens`
			`command = mkdir -p $$(dirname $out) && bunzip2 -c $in \| wiki2text -t > $out`

correct a Leeds bug; add some comments to rules.ninja 2015-05-26 22:08:04 +00:00			`# To tokenize Japanese, we run it through Mecab and take the first column.`
			`# We don't have a plan for tokenizing Chinese yet.`
WIP on more build steps 2015-05-07 20:49:53 +00:00			`rule tokenize_japanese`
give mecab a larger buffer 2015-05-26 23:34:46 +00:00			`command = mkdir -p $$(dirname $out) && mecab -b 1048576 < $in \| cut -f 1 \| grep -v "EOS" > $out`
WIP on more build steps 2015-05-07 20:49:53 +00:00
give mecab a larger buffer 2015-05-26 23:34:46 +00:00			`# Tokenizing text from Twitter generally requires us to use a more powerful`
			`# tokenizer than the language-agnostic one.`
			`#`
			`# Our usual build process does not use this step. It just assumes it's already`
			`# done, because it takes a very long time. This is what the 'data/intermediate'`
			`# directory contains.`
define some ninja rules 2015-04-29 21:13:58 +00:00			`rule tokenize_twitter`
removed intermediate twitter file rules 2015-06-16 21:28:09 +00:00			`command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix`
WIP on more build steps 2015-05-07 20:49:53 +00:00
			`# To convert the Leeds corpus, look for space-separated lines that start with`
			`# an integer and a decimal. The integer is the rank, which we discard. The`
			`# decimal is the frequency, and the remaining text is the term. Use sed -n`
			`# with /p to output only lines where the match was successful.`
correct a Leeds bug; add some comments to rules.ninja 2015-05-26 22:08:04 +00:00			`#`
			`# Grep out the term "EOS", an indication that Leeds used MeCab and didn't`
			`# strip out the EOS lines.`
WIP on more build steps 2015-05-07 20:49:53 +00:00			`rule convert_leeds`
correct a Leeds bug; add some comments to rules.ninja 2015-05-26 22:08:04 +00:00			`command = mkdir -p $$(dirname $out) && sed -rn 's/([0-9]+) ([0-9.]+) (.*)/\3,\2/p' < $in \| grep -v 'EOS,' > $out`
WIP on more build steps 2015-05-07 20:49:53 +00:00
			`# To convert the OpenSubtitles frequency data, simply replace spaces with`
			`# commas.`
			`rule convert_opensubtitles`
			`command = mkdir -p $$(dirname $out) && tr ' ' ',' < $in > $out`
define some ninja rules 2015-04-29 21:13:58 +00:00
add Google Books data for English 2015-05-11 22:44:28 +00:00			`# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all`
			`# the input files, keep only the single words and their counts, and only keep`
			`# lines with counts of 100 or more.`
			`#`
			`# (These will still be repeated as the word appears in different grammatical`
			`# roles, information that the source data provides that we're discarding. The`
			`# source data was already filtered to only show words in roles with at least`
			`# two-digit counts of occurences.)`
			`rule convert_google_syntactic_ngrams`
			`command = mkdir -p $$(dirname $out) && zcat $in \| cut -f 1,3 \| grep -v '[,"]' \| sed -rn 's/(.*)\s(...+)/\1,\2/p' > $out`

add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary 2015-05-05 17:59:21 +00:00			`rule count`
WIP on more build steps 2015-05-07 20:49:53 +00:00			`command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.count_tokens $in $out`
add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary 2015-05-05 17:59:21 +00:00
a reasonably complete build process 2015-05-07 23:38:33 +00:00			`rule merge`
			`command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.combine_lists -o $out $in`

			`rule freqs2dB`
			`command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.freqs_to_dB $in $out`

define some ninja rules 2015-04-29 21:13:58 +00:00			`rule cat`
			`command = cat $in > $out`