mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
Merge pull request #18 from LuminosoInsight/add-builder
Add wordfreq_builder as a sub-directory to wordfreq
Former-commit-id: e02f41076d
This commit is contained in:
commit
eebaa071fe
12
wordfreq_builder/.gitignore
vendored
Normal file
12
wordfreq_builder/.gitignore
vendored
Normal file
@ -0,0 +1,12 @@
|
||||
*.pyc
|
||||
__pycache__
|
||||
.coverage
|
||||
.idea
|
||||
dist
|
||||
*.egg-info
|
||||
build
|
||||
_build
|
||||
build.ninja
|
||||
data
|
||||
.ninja_deps
|
||||
.ninja_log
|
8
wordfreq_builder/Makefile
Normal file
8
wordfreq_builder/Makefile
Normal file
@ -0,0 +1,8 @@
|
||||
PYTHON = python
|
||||
|
||||
all: build.ninja
|
||||
|
||||
# build the Ninja file that will take over the build process
|
||||
build.ninja: rules.ninja wordfreq_builder/ninja.py wordfreq_builder/config.py wordfreq_builder.egg-info/PKG-INFO
|
||||
$(PYTHON) -m wordfreq_builder.cli.build_deps rules.ninja > build.ninja
|
||||
|
148
wordfreq_builder/README.md
Normal file
148
wordfreq_builder/README.md
Normal file
@ -0,0 +1,148 @@
|
||||
# wordfreq\_builder
|
||||
|
||||
This package builds the data files for [wordfreq](https://github.com/LuminosoInsight/wordfreq).
|
||||
|
||||
It requires a fair amount of external input data (42 GB of it, as of this
|
||||
writing), which unfortunately we don't have a plan for how to distribute
|
||||
outside of Luminoso yet.
|
||||
|
||||
The data can be publicly obtained in various ways, so here we'll at least
|
||||
document where it comes from. We hope to come up with a process that's more
|
||||
reproducible eventually.
|
||||
|
||||
The good news is that you don't need to be able to run this process to use
|
||||
wordfreq. The built results are already in the `wordfreq/data` directory.
|
||||
|
||||
## How to build it
|
||||
|
||||
Set up your external hard disk, your networked file system, or whatever thing
|
||||
you have that's got a couple hundred GB of space free. Let's suppose the
|
||||
directory of it that you want to use is called `/ext/data`.
|
||||
|
||||
Get the input data. At Luminoso, this is available in the directory
|
||||
`/nfs/broadway/data/wordfreq_builder`. The sections below explain where the
|
||||
data comes from.
|
||||
|
||||
Copy the input data:
|
||||
|
||||
cp -rv /nfs/broadway/data/wordfreq_builder /ext/data/
|
||||
|
||||
Make a symbolic link so that `data/` in this directory points to
|
||||
your copy of the input data:
|
||||
|
||||
ln -s /ext/data/wordfreq_builder data
|
||||
|
||||
Install the Ninja build system:
|
||||
|
||||
sudo apt-get install ninja-build
|
||||
|
||||
We need to build a Ninja build file using the Python code in
|
||||
`wordfreq_builder/ninja.py`. We could do this with Ninja, but... you see the
|
||||
chicken-and-egg problem, don't you. So this is the one thing the Makefile
|
||||
knows how to do.
|
||||
|
||||
make
|
||||
|
||||
Start the build, and find something else to do for a few hours:
|
||||
|
||||
ninja -v
|
||||
|
||||
You can copy the results into wordfreq with this command (supposing that
|
||||
$WORDFREQ points to your wordfreq repo):
|
||||
|
||||
cp data/dist/*.msgpack.gz ../wordfreq/data/
|
||||
|
||||
|
||||
## The Ninja build process
|
||||
|
||||
Ninja is a lot like Make, except with one big {drawback|advantage}: instead of
|
||||
writing bizarre expressions in an idiosyncratic language to let Make calculate
|
||||
which files depend on which other files...
|
||||
|
||||
...you just tell Ninja which files depend on which other files.
|
||||
|
||||
The Ninja documentation suggests using your favorite scripting language to
|
||||
create the dependency list, so that's what we've done in `ninja.py`.
|
||||
|
||||
Dependencies in Ninja refer to build rules. These do need to be written by hand
|
||||
in Ninja's own format, but the task is simpler. In this project, the build
|
||||
rules are defined in `rules.ninja`. They'll be concatenated with the
|
||||
Python-generated dependency definitions to form the complete build file,
|
||||
`build.ninja`, which is the default file that Ninja looks at when you run
|
||||
`ninja`.
|
||||
|
||||
So a lot of the interesting work in this package is done in `rules.ninja`.
|
||||
This file defines shorthand names for long commands. As a simple example,
|
||||
the rule named `format_twitter` applies the command
|
||||
|
||||
python -m wordfreq_builder.cli.format_twitter $in $out
|
||||
|
||||
to the dependency file `$in` and the output file `$out`.
|
||||
|
||||
The specific rules are described by the comments in `rules.ninja`.
|
||||
|
||||
## Data sources
|
||||
|
||||
### Leeds Internet Corpus
|
||||
|
||||
Also known as the "Web as Corpus" project, this is a University of Leeds
|
||||
project that collected wordlists in assorted languages by crawling the Web.
|
||||
The results are messy, but they're something. We've been using them for quite
|
||||
a while.
|
||||
|
||||
These files can be downloaded from the [Leeds corpus page][leeds].
|
||||
|
||||
The original files are in `data/source-lists/leeds`, and they're processed
|
||||
by the `convert_leeds` rule in `rules.ninja`.
|
||||
|
||||
[leeds]: http://corpus.leeds.ac.uk/list.html
|
||||
|
||||
### Twitter
|
||||
|
||||
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
|
||||
collected by the `ftfy.streamtester` package in 2014.
|
||||
|
||||
It's not possible to distribute the text of tweets. However, this process could
|
||||
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
|
||||
a couple of weeks.
|
||||
|
||||
[ftfy]: https://github.com/LuminosoInsight/python-ftfy
|
||||
|
||||
### Google Books
|
||||
|
||||
We use English word frequencies from [Google Books Syntactic Ngrams][gbsn].
|
||||
We pretty much ignore the syntactic information, and only use this version
|
||||
because it's cleaner. The data comes in the form of 99 gzipped text files in
|
||||
`data/raw-input/google-books`.
|
||||
|
||||
[gbsn]: http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
|
||||
|
||||
### OpenSubtitles
|
||||
|
||||
[Some guy](https://invokeit.wordpress.com/frequency-word-lists/) made word
|
||||
frequency lists out of the subtitle text on OpenSubtitles. This data was
|
||||
used to make Wiktionary word frequency lists at one point, but it's been
|
||||
updated significantly since the version Wiktionary got.
|
||||
|
||||
The wordlists are in `data/source-lists/opensubtitles`.
|
||||
|
||||
In order to fit into the wordfreq pipeline, we renamed lists with different variants
|
||||
of the same language code, to distinguish them fully according to BCP 47. Then we
|
||||
concatenated the different variants into a single list, as follows:
|
||||
|
||||
* `zh_tw.txt` was renamed to `zh-Hant.txt`
|
||||
* `zh_cn.txt` was renamed to `zh-Hans.txt`
|
||||
* `zh.txt` was renamed to `zh-Hani.txt`
|
||||
* `zh-Hant.txt`, `zh-Hans.txt`, and `zh-Hani.txt` were concatenated into `zh.txt`
|
||||
* `pt.txt` was renamed to `pt-PT.txt`
|
||||
* `pt_br.txt` was renamed to `pt-BR.txt`
|
||||
* `pt-BR.txt` and `pt-PT.txt` were concatenated into `pt.txt`
|
||||
|
||||
We also edited the English data to re-add "'t" to words that had obviously lost
|
||||
it, such as "didn" in the place of "didn't". We applied this to words that
|
||||
became much less common words in the process, which means this wordlist no
|
||||
longer represents the words 'don' and 'won', as we assume most of their
|
||||
frequency comes from "don't" and "won't". Words that turned into similarly
|
||||
common words, however, were left alone: this list doesn't represent "can't"
|
||||
because the word was left as "can".
|
||||
|
BIN
wordfreq_builder/build.png
Normal file
BIN
wordfreq_builder/build.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 3.3 MiB |
80
wordfreq_builder/rules.ninja
Normal file
80
wordfreq_builder/rules.ninja
Normal file
@ -0,0 +1,80 @@
|
||||
# This defines the rules on how to build parts of the wordfreq lists, using the
|
||||
# Ninja build system:
|
||||
#
|
||||
# http://martine.github.io/ninja/manual.html
|
||||
#
|
||||
# Ninja is available in the 'ninja-build' Ubuntu package. It's like make with
|
||||
# better parallelism and the ability for build steps to produce multiple
|
||||
# outputs. The tradeoff is that its rule syntax isn't full of magic for
|
||||
# expanding wildcards and finding dependencies, so in general you have to
|
||||
# write the dependencies using a script.
|
||||
#
|
||||
# This file will become the header of the larger build.ninja file, which also
|
||||
# contains the programatically-defined dependency graph.
|
||||
|
||||
# Variables
|
||||
DATA = ./data
|
||||
|
||||
# How to build the build.ninja file itself. (Use the Makefile to get it the
|
||||
# first time.)
|
||||
rule build_deps
|
||||
command = python -m wordfreq_builder.cli.build_deps $in > $out
|
||||
|
||||
# Splits the single file $in into $slices parts, whose names will be
|
||||
# $prefix plus a two-digit numeric suffix.
|
||||
rule split
|
||||
command = mkdir -p $$(dirname $prefix) && split -d -n r/$slices $in $prefix
|
||||
|
||||
# wiki2text is a tool I wrote using Nim 0.11, which extracts plain text from
|
||||
# Wikipedia dumps obtained from dumps.wikimedia.org. The code is at
|
||||
# https://github.com/rspeer/wiki2text.
|
||||
rule wiki2text
|
||||
command = mkdir -p $$(dirname $out) && bunzip2 -c $in | wiki2text > $out
|
||||
|
||||
# To tokenize Japanese, we run it through Mecab and take the first column.
|
||||
# We don't have a plan for tokenizing Chinese yet.
|
||||
rule tokenize_japanese
|
||||
command = mkdir -p $$(dirname $out) && mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out
|
||||
|
||||
# Tokenizing text from Twitter requires us to language-detect and tokenize
|
||||
# in the same step.
|
||||
rule tokenize_twitter
|
||||
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix
|
||||
|
||||
# To convert the Leeds corpus, look for space-separated lines that start with
|
||||
# an integer and a decimal. The integer is the rank, which we discard. The
|
||||
# decimal is the frequency, and the remaining text is the term. Use sed -n
|
||||
# with /p to output only lines where the match was successful.
|
||||
#
|
||||
# Grep out the term "EOS", an indication that Leeds used MeCab and didn't
|
||||
# strip out the EOS lines.
|
||||
rule convert_leeds
|
||||
command = mkdir -p $$(dirname $out) && sed -rn 's/([0-9]+) ([0-9.]+) (.*)/\3,\2/p' < $in | grep -v 'EOS,' > $out
|
||||
|
||||
# To convert the OpenSubtitles frequency data, simply replace spaces with
|
||||
# commas.
|
||||
rule convert_opensubtitles
|
||||
command = mkdir -p $$(dirname $out) && tr ' ' ',' < $in > $out
|
||||
|
||||
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
|
||||
# the input files, keep only the single words and their counts, and only keep
|
||||
# lines with counts of 100 or more.
|
||||
#
|
||||
# (These will still be repeated as the word appears in different grammatical
|
||||
# roles, information that the source data provides that we're discarding. The
|
||||
# source data was already filtered to only show words in roles with at least
|
||||
# two-digit counts of occurences.)
|
||||
rule convert_google_syntactic_ngrams
|
||||
command = mkdir -p $$(dirname $out) && zcat $in | cut -f 1,3 | grep -v '[,"]' | sed -rn 's/(.*)\s(...+)/\1,\2/p' > $out
|
||||
|
||||
rule count
|
||||
command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.count_tokens $in $out
|
||||
|
||||
rule merge
|
||||
command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.combine_lists -o $out $in
|
||||
|
||||
rule freqs2cB
|
||||
command = mkdir -p $$(dirname $out) && python -m wordfreq_builder.cli.freqs_to_cB $in $out
|
||||
|
||||
rule cat
|
||||
command = cat $in > $out
|
20
wordfreq_builder/setup.py
Executable file
20
wordfreq_builder/setup.py
Executable file
@ -0,0 +1,20 @@
|
||||
from setuptools import setup
|
||||
|
||||
setup(
|
||||
name="wordfreq_builder",
|
||||
version='0.1',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq_builder',
|
||||
platforms=["any"],
|
||||
description="Turns raw data into word frequency lists",
|
||||
packages=['wordfreq_builder'],
|
||||
install_requires=['msgpack-python', 'pycld2'],
|
||||
entry_points={
|
||||
'console_scripts': [
|
||||
'wordfreq-pretokenize-twitter = wordfreq_builder.cli.pretokenize_twitter:main',
|
||||
'wordfreq-format-twitter = wordfreq_builder.cli.format_twitter:main',
|
||||
'wordfreq-build-deps = wordfreq_builder.cli.build_deps:main'
|
||||
]
|
||||
}
|
||||
)
|
0
wordfreq_builder/wordfreq_builder/__init__.py
Normal file
0
wordfreq_builder/wordfreq_builder/__init__.py
Normal file
0
wordfreq_builder/wordfreq_builder/cli/__init__.py
Normal file
0
wordfreq_builder/wordfreq_builder/cli/__init__.py
Normal file
15
wordfreq_builder/wordfreq_builder/cli/build_deps.py
Normal file
15
wordfreq_builder/wordfreq_builder/cli/build_deps.py
Normal file
@ -0,0 +1,15 @@
|
||||
from wordfreq_builder.ninja import make_ninja_deps
|
||||
import argparse
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('in_filename', help='filename of rules file')
|
||||
args = parser.parse_args()
|
||||
|
||||
# Make the complete ninja file and write it to standard out
|
||||
make_ninja_deps(args.in_filename)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
19
wordfreq_builder/wordfreq_builder/cli/combine_lists.py
Normal file
19
wordfreq_builder/wordfreq_builder/cli/combine_lists.py
Normal file
@ -0,0 +1,19 @@
|
||||
from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
|
||||
import argparse
|
||||
|
||||
|
||||
def merge_lists(input_names, output_name):
|
||||
freq_dicts = []
|
||||
for input_name in input_names:
|
||||
freq_dicts.append(read_freqs(input_name, cutoff=2))
|
||||
merged = merge_freqs(freq_dicts)
|
||||
write_wordlist(merged, output_name)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('-o', '--output', help='filename to write the output to', default='combined-counts.csv')
|
||||
parser.add_argument('inputs', help='names of input files to merge', nargs='+')
|
||||
args = parser.parse_args()
|
||||
merge_lists(args.inputs, args.output)
|
||||
|
16
wordfreq_builder/wordfreq_builder/cli/count_tokens.py
Normal file
16
wordfreq_builder/wordfreq_builder/cli/count_tokens.py
Normal file
@ -0,0 +1,16 @@
|
||||
from wordfreq_builder.word_counts import count_tokens, write_wordlist
|
||||
import argparse
|
||||
|
||||
|
||||
def handle_counts(filename_in, filename_out):
|
||||
counts = count_tokens(filename_in)
|
||||
write_wordlist(counts, filename_out)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||
parser.add_argument('filename_out', help='name of output file')
|
||||
args = parser.parse_args()
|
||||
handle_counts(args.filename_in, args.filename_out)
|
||||
|
11
wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
Normal file
11
wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
Normal file
@ -0,0 +1,11 @@
|
||||
from wordfreq_builder.word_counts import freqs_to_cBpack
|
||||
import argparse
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||
parser.add_argument('filename_out', help='name of output file')
|
||||
args = parser.parse_args()
|
||||
freqs_to_cBpack(args.filename_in, args.filename_out)
|
||||
|
19
wordfreq_builder/wordfreq_builder/cli/tokenize_twitter.py
Normal file
19
wordfreq_builder/wordfreq_builder/cli/tokenize_twitter.py
Normal file
@ -0,0 +1,19 @@
|
||||
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_file
|
||||
import argparse
|
||||
|
||||
|
||||
def tokenize_twitter(in_filename, out_prefix):
|
||||
tokenize_file(in_filename, out_prefix,
|
||||
tokenizer=cld2_surface_tokenizer)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename', help='filename of input file containing one tweet per line')
|
||||
parser.add_argument('outprefix', help='prefix of output filenames')
|
||||
args = parser.parse_args()
|
||||
tokenize_twitter(args.filename, args.outprefix)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
87
wordfreq_builder/wordfreq_builder/config.py
Normal file
87
wordfreq_builder/wordfreq_builder/config.py
Normal file
@ -0,0 +1,87 @@
|
||||
import os
|
||||
|
||||
CONFIG = {
|
||||
'version': '1.0b',
|
||||
# data_dir is a relative or absolute path to where the wordlist data
|
||||
# is stored
|
||||
'data_dir': 'data',
|
||||
'sources': {
|
||||
# A list of language codes (possibly un-standardized) that we'll
|
||||
# look up in filenames for these various data sources.
|
||||
'twitter': [
|
||||
'ar', 'de', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pt', 'ru',
|
||||
# can be added later: 'th', 'tr'
|
||||
],
|
||||
'wikipedia': [
|
||||
'ar', 'de', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pt', 'ru'
|
||||
# many more can be added
|
||||
],
|
||||
'opensubtitles': [
|
||||
# All languages where the most common word in OpenSubtitles
|
||||
# appears at least 5000 times
|
||||
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et',
|
||||
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
|
||||
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
|
||||
'sr', 'sv', 'tr', 'uk', 'zh'
|
||||
],
|
||||
'leeds': [
|
||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'it', 'ja', 'pt', 'ru', 'zh'
|
||||
],
|
||||
'google-books': [
|
||||
'en',
|
||||
# Using the 2012 data, we could get French, German, Italian,
|
||||
# Russian, Spanish, and (Simplified) Chinese.
|
||||
]
|
||||
},
|
||||
'wordlist_paths': {
|
||||
'twitter': 'generated/twitter/tweets-2014.{lang}.{ext}',
|
||||
'wikipedia': 'generated/wikipedia/wikipedia_{lang}.{ext}',
|
||||
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
|
||||
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
|
||||
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
|
||||
'combined': 'generated/combined/combined_{lang}.{ext}',
|
||||
'combined-dist': 'dist/combined_{lang}.{ext}',
|
||||
'twitter-dist': 'dist/twitter_{lang}.{ext}'
|
||||
},
|
||||
'min_sources': 2
|
||||
}
|
||||
|
||||
|
||||
def data_filename(filename):
|
||||
"""
|
||||
Convert a relative filename to a path inside the configured data_dir.
|
||||
"""
|
||||
return os.path.join(CONFIG['data_dir'], filename)
|
||||
|
||||
|
||||
def wordlist_filename(source, language, extension='txt'):
|
||||
"""
|
||||
Get the path where a particular built wordlist should go, parameterized by
|
||||
its language and its file extension.
|
||||
"""
|
||||
path = CONFIG['wordlist_paths'][source].format(
|
||||
lang=language, ext=extension
|
||||
)
|
||||
return data_filename(path)
|
||||
|
||||
|
||||
def source_names(language):
|
||||
"""
|
||||
Get the names of data sources that supply data for the given language.
|
||||
"""
|
||||
return sorted(key for key in CONFIG['sources']
|
||||
if language in CONFIG['sources'][key])
|
||||
|
||||
|
||||
def all_languages():
|
||||
"""
|
||||
Get all languages that should have their data built, which is those that
|
||||
are supported by at least `min_sources` sources.
|
||||
"""
|
||||
languages = set()
|
||||
for langlist in CONFIG['sources'].values():
|
||||
languages |= set(langlist)
|
||||
return [lang for lang in sorted(languages)
|
||||
if len(source_names(lang)) >= CONFIG['min_sources']]
|
231
wordfreq_builder/wordfreq_builder/ninja.py
Normal file
231
wordfreq_builder/wordfreq_builder/ninja.py
Normal file
@ -0,0 +1,231 @@
|
||||
from wordfreq_builder.config import (
|
||||
CONFIG, data_filename, wordlist_filename, all_languages, source_names
|
||||
)
|
||||
import sys
|
||||
import pathlib
|
||||
|
||||
HEADER = """# This file is automatically generated. Do not edit it.
|
||||
# You can regenerate it using the 'wordfreq-build-deps' command.
|
||||
"""
|
||||
TMPDIR = data_filename('tmp')
|
||||
|
||||
|
||||
# Set this to True to rebuild the Twitter tokenization (which takes days)
|
||||
TOKENIZE_TWITTER = True
|
||||
|
||||
|
||||
def add_dep(lines, rule, input, output, extra=None, params=None):
|
||||
if isinstance(output, list):
|
||||
output = ' '.join(output)
|
||||
if isinstance(input, list):
|
||||
input = ' '.join(input)
|
||||
if extra:
|
||||
if isinstance(extra, list):
|
||||
extra = ' '.join(extra)
|
||||
extrastr = ' | ' + extra
|
||||
else:
|
||||
extrastr = ''
|
||||
build_rule = "build {output}: {rule} {input}{extra}".format(
|
||||
output=output, rule=rule, input=input, extra=extrastr
|
||||
)
|
||||
lines.append(build_rule)
|
||||
if params:
|
||||
for key, val in params.items():
|
||||
lines.append(" {key} = {val}".format(key=key, val=val))
|
||||
lines.append("")
|
||||
|
||||
|
||||
def make_ninja_deps(rules_filename, out=sys.stdout):
|
||||
"""
|
||||
Output a complete Ninja file describing how to build the wordfreq data.
|
||||
"""
|
||||
print(HEADER, file=out)
|
||||
# Copy in the rules section
|
||||
with open(rules_filename, encoding='utf-8') as rulesfile:
|
||||
print(rulesfile.read(), file=out)
|
||||
|
||||
lines = []
|
||||
# The first dependency is to make sure the build file is up to date.
|
||||
add_dep(lines, 'build_deps', 'rules.ninja', 'build.ninja',
|
||||
extra='wordfreq_builder/ninja.py')
|
||||
|
||||
if TOKENIZE_TWITTER:
|
||||
lines.extend(
|
||||
twitter_deps(
|
||||
data_filename('raw-input/twitter/all-2014.txt'),
|
||||
slice_prefix=data_filename('slices/twitter/tweets-2014'),
|
||||
combined_prefix=data_filename('generated/twitter/tweets-2014'),
|
||||
slices=40,
|
||||
languages=CONFIG['sources']['twitter']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
wikipedia_deps(
|
||||
data_filename('raw-input/wikipedia'),
|
||||
CONFIG['sources']['wikipedia']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
google_books_deps(
|
||||
data_filename('raw-input/google-books')
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
leeds_deps(
|
||||
data_filename('source-lists/leeds'),
|
||||
CONFIG['sources']['leeds']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
opensubtitles_deps(
|
||||
data_filename('source-lists/opensubtitles'),
|
||||
CONFIG['sources']['opensubtitles']
|
||||
)
|
||||
)
|
||||
lines.extend(combine_lists(all_languages()))
|
||||
|
||||
print('\n'.join(lines), file=out)
|
||||
|
||||
|
||||
def wikipedia_deps(dirname_in, languages):
|
||||
lines = []
|
||||
path_in = pathlib.Path(dirname_in)
|
||||
for language in languages:
|
||||
# Find the most recent file for this language
|
||||
# Skip over files that do not exist
|
||||
input_file = max(path_in.glob(
|
||||
'{}wiki*.bz2'.format(language)
|
||||
))
|
||||
plain_text_file = wordlist_filename('wikipedia', language, 'txt')
|
||||
count_file = wordlist_filename('wikipedia', language, 'counts.txt')
|
||||
|
||||
add_dep(lines, 'wiki2text', input_file, plain_text_file)
|
||||
if language == 'ja':
|
||||
mecab_token_file = wordlist_filename('wikipedia', language, 'mecab-tokens.txt')
|
||||
add_dep(lines, 'tokenize_japanese', plain_text_file, mecab_token_file)
|
||||
add_dep(lines, 'count', mecab_token_file, count_file)
|
||||
else:
|
||||
add_dep(lines, 'count', plain_text_file, count_file)
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def google_books_deps(dirname_in):
|
||||
# Get English data from the split-up files of the Google Syntactic N-grams
|
||||
# 2013 corpus.
|
||||
lines = []
|
||||
|
||||
# Yes, the files are numbered 00 through 98 of 99. This is not an
|
||||
# off-by-one error. Not on my part, anyway.
|
||||
input_files = [
|
||||
'{}/nodes.{:>02d}-of-99.gz'.format(dirname_in, i)
|
||||
for i in range(99)
|
||||
]
|
||||
output_file = wordlist_filename('google-books', 'en', 'counts.txt')
|
||||
add_dep(lines, 'convert_google_syntactic_ngrams', input_files, output_file)
|
||||
return lines
|
||||
|
||||
|
||||
def twitter_deps(input_filename, slice_prefix,
|
||||
combined_prefix, slices, languages):
|
||||
lines = []
|
||||
|
||||
slice_files = ['{prefix}.part{num:0>2d}'.format(prefix=slice_prefix, num=num)
|
||||
for num in range(slices)]
|
||||
# split the input into slices
|
||||
add_dep(lines,
|
||||
'split', input_filename, slice_files,
|
||||
params={'prefix': '{}.part'.format(slice_prefix),
|
||||
'slices': slices})
|
||||
|
||||
for slicenum in range(slices):
|
||||
slice_file = slice_files[slicenum]
|
||||
language_outputs = [
|
||||
'{prefix}.{lang}.txt'.format(prefix=slice_file, lang=language)
|
||||
for language in languages
|
||||
]
|
||||
add_dep(lines, 'tokenize_twitter', slice_file, language_outputs,
|
||||
params={'prefix': slice_file})
|
||||
|
||||
for language in languages:
|
||||
combined_output = wordlist_filename('twitter', language, 'tokens.txt')
|
||||
|
||||
language_inputs = [
|
||||
'{prefix}.{lang}.txt'.format(prefix=slice_files[slicenum], lang=language)
|
||||
for slicenum in range(slices)
|
||||
]
|
||||
|
||||
add_dep(lines, 'cat', language_inputs, combined_output)
|
||||
|
||||
count_file = wordlist_filename('twitter', language, 'counts.txt')
|
||||
|
||||
if language == 'ja':
|
||||
mecab_token_file = wordlist_filename('twitter', language, 'mecab-tokens.txt')
|
||||
add_dep(lines, 'tokenize_japanese', combined_output, mecab_token_file)
|
||||
add_dep(lines, 'count', mecab_token_file, count_file, extra='wordfreq_builder/tokenizers.py')
|
||||
else:
|
||||
add_dep(lines, 'count', combined_output, count_file, extra='wordfreq_builder/tokenizers.py')
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def leeds_deps(dirname_in, languages):
|
||||
lines = []
|
||||
for language in languages:
|
||||
input_file = '{prefix}/internet-{lang}-forms.num'.format(
|
||||
prefix=dirname_in, lang=language
|
||||
)
|
||||
reformatted_file = wordlist_filename('leeds', language, 'counts.txt')
|
||||
add_dep(lines, 'convert_leeds', input_file, reformatted_file)
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def opensubtitles_deps(dirname_in, languages):
|
||||
lines = []
|
||||
for language in languages:
|
||||
input_file = '{prefix}/{lang}.txt'.format(
|
||||
prefix=dirname_in, lang=language
|
||||
)
|
||||
reformatted_file = wordlist_filename('opensubtitles', language, 'counts.txt')
|
||||
add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file)
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def combine_lists(languages):
|
||||
lines = []
|
||||
for language in languages:
|
||||
sources = source_names(language)
|
||||
input_files = [
|
||||
wordlist_filename(source, language, 'counts.txt')
|
||||
for source in sources
|
||||
]
|
||||
output_file = wordlist_filename('combined', language)
|
||||
add_dep(lines, 'merge', input_files, output_file,
|
||||
extra='wordfreq_builder/word_counts.py')
|
||||
|
||||
output_cBpack = wordlist_filename('combined-dist', language, 'msgpack.gz')
|
||||
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
|
||||
extra='wordfreq_builder/word_counts.py')
|
||||
|
||||
lines.append('default {}'.format(output_cBpack))
|
||||
|
||||
# Write standalone lists for Twitter frequency
|
||||
if language in CONFIG['sources']['twitter']:
|
||||
input_file = wordlist_filename('twitter', language, 'counts.txt')
|
||||
output_cBpack = wordlist_filename('twitter-dist', language, 'msgpack.gz')
|
||||
add_dep(lines, 'freqs2cB', input_file, output_cBpack,
|
||||
extra='wordfreq_builder/word_counts.py')
|
||||
|
||||
lines.append('default {}'.format(output_cBpack))
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def main():
|
||||
make_ninja_deps('rules.ninja')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
29
wordfreq_builder/wordfreq_builder/ninja2dot.py
Normal file
29
wordfreq_builder/wordfreq_builder/ninja2dot.py
Normal file
@ -0,0 +1,29 @@
|
||||
import sys
|
||||
|
||||
|
||||
def ninja_to_dot():
|
||||
def last_component(path):
|
||||
return path.split('/')[-1]
|
||||
|
||||
print("digraph G {")
|
||||
print('rankdir="LR";')
|
||||
for line in sys.stdin:
|
||||
line = line.rstrip()
|
||||
parts = line.split(' ')
|
||||
if parts[0] == 'build':
|
||||
# the output file is the first argument; strip off the colon that
|
||||
# comes from ninja syntax
|
||||
outfile = last_component(parts[1][:-1])
|
||||
operation = parts[2]
|
||||
infiles = [last_component(part) for part in parts[3:]]
|
||||
for infile in infiles:
|
||||
if infile == '|':
|
||||
# external dependencies start here; let's not graph those
|
||||
break
|
||||
print('"%s" -> "%s" [label="%s"]' % (infile, outfile, operation))
|
||||
print("}")
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
ninja_to_dot()
|
||||
|
51
wordfreq_builder/wordfreq_builder/tests/test_tokenizer.py
Normal file
51
wordfreq_builder/wordfreq_builder/tests/test_tokenizer.py
Normal file
@ -0,0 +1,51 @@
|
||||
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, cld2_detect_language
|
||||
from nose.tools import eq_
|
||||
|
||||
|
||||
def test_tokenizer_1():
|
||||
text = '"This is a test," she said, "and I\'ll bet y\'all $3.50 that it won\'t fail."'
|
||||
tokens = [
|
||||
'this', 'is', 'a', 'test', 'she', 'said',
|
||||
'and', "i'll", 'bet', "y'all", '3', '50', 'that',
|
||||
'it', "won't", 'fail',
|
||||
]
|
||||
result = cld2_surface_tokenizer(text)
|
||||
eq_(result[1], tokens)
|
||||
eq_(result[0], 'en')
|
||||
|
||||
def test_tokenizer_2():
|
||||
text = "i use punctuation informally...see?like this."
|
||||
tokens = [
|
||||
'i', 'use', 'punctuation', 'informally', 'see',
|
||||
'like', 'this'
|
||||
]
|
||||
result = cld2_surface_tokenizer(text)
|
||||
eq_(result[1], tokens)
|
||||
eq_(result[0], 'en')
|
||||
|
||||
def test_tokenizer_3():
|
||||
text = "@ExampleHandle This parser removes twitter handles!"
|
||||
tokens = ['this', 'parser', 'removes', 'twitter', 'handles']
|
||||
result = cld2_surface_tokenizer(text)
|
||||
eq_(result[1], tokens)
|
||||
eq_(result[0], 'en')
|
||||
|
||||
def test_tokenizer_4():
|
||||
text = "This is a really boring example tco http://t.co/n15ASlkase"
|
||||
tokens = ['this', 'is', 'a', 'really', 'boring', 'example', 'tco']
|
||||
result = cld2_surface_tokenizer(text)
|
||||
eq_(result[1], tokens)
|
||||
eq_(result[0], 'en')
|
||||
|
||||
|
||||
def test_language_recognizer_1():
|
||||
text = "Il est le meilleur livre que je ai jamais lu"
|
||||
result = cld2_detect_language(text)
|
||||
eq_(result, 'fr')
|
||||
|
||||
def test_language_recognizer_2():
|
||||
text = """A nuvem de Oort, também chamada de nuvem de Öpik-Oort,
|
||||
é uma nuvem esférica de planetesimais voláteis que se acredita
|
||||
localizar-se a cerca de 50 000 UA, ou quase um ano-luz, do Sol."""
|
||||
result = cld2_detect_language(text)
|
||||
eq_(result, 'pt')
|
115
wordfreq_builder/wordfreq_builder/tokenizers.py
Normal file
115
wordfreq_builder/wordfreq_builder/tokenizers.py
Normal file
@ -0,0 +1,115 @@
|
||||
from html.entities import name2codepoint
|
||||
from wordfreq import tokenize, TOKEN_RE, NON_PUNCT_RANGE
|
||||
import re
|
||||
import pycld2
|
||||
|
||||
CLD2_BAD_CHAR_RANGE = "".join([
|
||||
'[',
|
||||
'\x00-\x08',
|
||||
'\x0b',
|
||||
'\x0e-\x1f',
|
||||
'\x7f-\x9f',
|
||||
'\ud800-\udfff',
|
||||
'\ufdd0-\ufdef'] +
|
||||
[chr(65534+65536*x+y) for x in range(17) for y in range(2)] +
|
||||
[']'])
|
||||
CLD2_BAD_CHARS_RE = re.compile(CLD2_BAD_CHAR_RANGE)
|
||||
|
||||
TWITTER_HANDLE_RE = re.compile('@{0}+'.format(NON_PUNCT_RANGE))
|
||||
TCO_RE = re.compile('http(?:s)?://t.co/[a-zA-Z0-9]+'.format(NON_PUNCT_RANGE))
|
||||
|
||||
|
||||
def cld2_surface_tokenizer(text):
|
||||
"""
|
||||
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens
|
||||
"""
|
||||
text = remove_handles_and_urls(text)
|
||||
lang = cld2_detect_language(text)
|
||||
tokens = tokenize(text, lang)
|
||||
return lang, tokens
|
||||
|
||||
def cld2_detect_language(text):
|
||||
"""
|
||||
Uses CLD2 to detect the language
|
||||
"""
|
||||
text = CLD2_BAD_CHARS_RE.sub('', text)
|
||||
return pycld2.detect(text)[2][0][1]
|
||||
|
||||
def remove_handles_and_urls(text):
|
||||
text = fix_entities(text)
|
||||
text = TWITTER_HANDLE_RE.sub('', text)
|
||||
text = TCO_RE.sub('', text)
|
||||
return text
|
||||
|
||||
def last_tab(line):
|
||||
"""
|
||||
Read lines by keeping only the last tab-separated value.
|
||||
"""
|
||||
return line.split('\t')[-1].strip()
|
||||
|
||||
def lowercase_text_filter(token):
|
||||
"""
|
||||
If this looks like a token that we want to count, return it, lowercased.
|
||||
If not, filter it out by returning None.
|
||||
"""
|
||||
if TOKEN_RE.search(token):
|
||||
return token.lower()
|
||||
else:
|
||||
return None
|
||||
|
||||
def tokenize_file(in_filename, out_prefix, tokenizer, line_reader=last_tab):
|
||||
"""
|
||||
Process a file by running it through the given tokenizer, sorting the
|
||||
results by the language of each line, and inserting newlines
|
||||
to mark the token boundaries.
|
||||
"""
|
||||
out_files = {}
|
||||
with open(in_filename, encoding='utf-8') as in_file:
|
||||
for line in in_file:
|
||||
text = line_reader(line)
|
||||
language, tokens = tokenizer(text)
|
||||
if language != 'un':
|
||||
tokenized = '\n'.join(tokens)
|
||||
out_filename = '%s.%s.txt' % (out_prefix, language)
|
||||
if out_filename in out_files:
|
||||
out_file = out_files[out_filename]
|
||||
else:
|
||||
out_file = open(out_filename, 'w', encoding='utf-8')
|
||||
out_files[out_filename] = out_file
|
||||
print(tokenized, file=out_file)
|
||||
for out_file in out_files.values():
|
||||
out_file.close()
|
||||
|
||||
ENTITY_RE = re.compile(r'& ?(amp|quot|lt|gt) ?;')
|
||||
|
||||
def fix_entities(text):
|
||||
"""
|
||||
Fix the few HTML entities that Twitter uses -- even if they've
|
||||
already been tokenized.
|
||||
"""
|
||||
def replace_entity(match):
|
||||
return chr(name2codepoint[match.group(1)])
|
||||
return ENTITY_RE.sub(replace_entity, text)
|
||||
|
||||
def monolingual_tokenize_file(in_filename, out_filename, language,
|
||||
tokenizer, line_reader=last_tab,
|
||||
sample_proportion=1):
|
||||
"""
|
||||
Process a file by running it through the given tokenizer, only keeping
|
||||
lines of the language we're asking for, and inserting newlines
|
||||
to mark the token boundaries.
|
||||
|
||||
`line_reader` is applied to each line before it given to the tokenizer
|
||||
|
||||
Only the first line out of every `sample_proportion` lines are run through
|
||||
then tokenizer.
|
||||
"""
|
||||
with open(in_filename, encoding='utf-8', errors='replace') as in_file:
|
||||
with open(out_filename, 'w', encoding='utf-8') as out_file:
|
||||
for i, line in enumerate(in_file):
|
||||
if i % sample_proportion == 0:
|
||||
text = line_reader(line)
|
||||
tokens, line_language = tokenizer(text)
|
||||
if line_language == language:
|
||||
for token in tokens:
|
||||
print(token, file=out_file)
|
120
wordfreq_builder/wordfreq_builder/word_counts.py
Normal file
120
wordfreq_builder/wordfreq_builder/word_counts.py
Normal file
@ -0,0 +1,120 @@
|
||||
from wordfreq import simple_tokenize
|
||||
from collections import defaultdict
|
||||
from operator import itemgetter
|
||||
from ftfy import fix_text
|
||||
import math
|
||||
import csv
|
||||
import msgpack
|
||||
import gzip
|
||||
|
||||
|
||||
def count_tokens(filename):
|
||||
"""
|
||||
Count tokens that appear in a file, running each line through our
|
||||
simple tokenizer.
|
||||
|
||||
Unicode errors in the input data will become token boundaries.
|
||||
"""
|
||||
counts = defaultdict(int)
|
||||
with open(filename, encoding='utf-8', errors='replace') as infile:
|
||||
for line in infile:
|
||||
for token in simple_tokenize(line.strip()):
|
||||
counts[token] += 1
|
||||
return counts
|
||||
|
||||
|
||||
def read_freqs(filename, cutoff=0):
|
||||
"""
|
||||
Read words and their frequencies from a CSV file.
|
||||
|
||||
Only words with a frequency greater than `cutoff` are returned.
|
||||
|
||||
If `cutoff` is greater than 0, the csv file must be sorted by frequency
|
||||
in descending order.
|
||||
"""
|
||||
raw_counts = defaultdict(float)
|
||||
total = 0.
|
||||
with open(filename, encoding='utf-8', newline='') as infile:
|
||||
reader = csv.reader(infile)
|
||||
for key, strval in reader:
|
||||
val = float(strval)
|
||||
if val < cutoff:
|
||||
break
|
||||
for token in simple_tokenize(key):
|
||||
token = fix_text(token)
|
||||
total += val
|
||||
# Use += so that, if we give the reader concatenated files with
|
||||
# duplicates, it does the right thing
|
||||
raw_counts[token] += val
|
||||
|
||||
freqs = {key: raw_count / total
|
||||
for (key, raw_count) in raw_counts.items()}
|
||||
return freqs
|
||||
|
||||
|
||||
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600):
|
||||
"""
|
||||
Convert a csv file of words and their frequencies to a file in the
|
||||
idiosyncratic 'cBpack' format.
|
||||
|
||||
Only words with a frequency greater than `cutoff` centibels will be
|
||||
written to the new file.
|
||||
"""
|
||||
freq_cutoff = 10 ** (cutoff / 100.)
|
||||
freqs = read_freqs(in_filename, freq_cutoff)
|
||||
cBpack = []
|
||||
for token, freq in freqs.items():
|
||||
cB = round(math.log10(freq) * 100)
|
||||
if cB >= cutoff:
|
||||
neg_cB = -cB
|
||||
while neg_cB >= len(cBpack):
|
||||
cBpack.append([])
|
||||
cBpack[neg_cB].append(token)
|
||||
|
||||
for sublist in cBpack:
|
||||
sublist.sort()
|
||||
|
||||
# Write a "header" consisting of a dictionary at the start of the file
|
||||
cBpack_data = [{'format': 'cB', 'version': 1}] + cBpack
|
||||
|
||||
with gzip.open(out_filename, 'wb') as outfile:
|
||||
msgpack.dump(cBpack_data, outfile)
|
||||
|
||||
|
||||
def merge_freqs(freq_dicts):
|
||||
"""
|
||||
Merge multiple dictionaries of frequencies, representing each word with
|
||||
the word's average frequency over all sources.
|
||||
"""
|
||||
vocab = set()
|
||||
for freq_dict in freq_dicts:
|
||||
vocab |= set(freq_dict)
|
||||
|
||||
merged = defaultdict(float)
|
||||
N = len(freq_dicts)
|
||||
for term in vocab:
|
||||
term_total = 0.
|
||||
for freq_dict in freq_dicts:
|
||||
term_total += freq_dict.get(term, 0.)
|
||||
merged[term] = term_total / N
|
||||
|
||||
return merged
|
||||
|
||||
|
||||
def write_wordlist(freqs, filename, cutoff=1e-8):
|
||||
"""
|
||||
Write a dictionary of either raw counts or frequencies to a file of
|
||||
comma-separated values.
|
||||
|
||||
Keep the CSV format simple by explicitly skipping words containing
|
||||
commas or quotation marks. We don't believe we want those in our tokens
|
||||
anyway.
|
||||
"""
|
||||
with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
|
||||
writer = csv.writer(outfile)
|
||||
items = sorted(freqs.items(), key=itemgetter(1), reverse=True)
|
||||
for word, freq in items:
|
||||
if freq < cutoff:
|
||||
break
|
||||
if not ('"' in word or ',' in word):
|
||||
writer.writerow([word, str(freq)])
|
Loading…
Reference in New Issue
Block a user