remove wordfreq_builder (obsoleted by exquisite-corpus)

This commit is contained in:
Rob Speer 2017-01-04 17:45:53 -05:00
parent b3e5d1c9e9
commit 6171b3d066
25 changed files with 0 additions and 1567 deletions

View File

@ -1,12 +0,0 @@
*.pyc
__pycache__
.coverage
.idea
dist
*.egg-info
build
_build
build.ninja
data
.ninja_deps
.ninja_log

View File

@ -1,8 +0,0 @@
PYTHON = python
all: build.ninja
# build the Ninja file that will take over the build process
build.ninja: rules.ninja wordfreq_builder/ninja.py wordfreq_builder/config.py wordfreq_builder.egg-info/PKG-INFO
$(PYTHON) -m wordfreq_builder.cli.build_deps rules.ninja > build.ninja

View File

@ -1,194 +0,0 @@
# wordfreq\_builder
This package builds the data files for [wordfreq](https://github.com/LuminosoInsight/wordfreq).
It requires a fair amount of external input data (42 GB of it, as of this
writing), which unfortunately we don't have a plan for how to distribute
outside of Luminoso yet.
The data can be publicly obtained in various ways, so here we'll at least
document where it comes from. We hope to come up with a process that's more
reproducible eventually.
The good news is that you don't need to be able to run this process to use
wordfreq. The built results are already in the `wordfreq/data` directory.
## How to build it
Set up your external hard disk, your networked file system, or whatever thing
you have that's got a couple hundred GB of space free. Let's suppose the
directory of it that you want to use is called `/ext/data`.
Get the input data. At Luminoso, this is available in the directory
`/nfs/broadway/data/wordfreq_builder`. The sections below explain where the
data comes from.
Copy the input data:
cp -rv /nfs/broadway/data/wordfreq_builder /ext/data/
Make a symbolic link so that `data/` in this directory points to
your copy of the input data:
ln -s /ext/data/wordfreq_builder data
Install the Ninja build system:
sudo apt-get install ninja-build
We need to build a Ninja build file using the Python code in
`wordfreq_builder/ninja.py`. We could do this with Ninja, but... you see the
chicken-and-egg problem, don't you. So this is the one thing the Makefile
knows how to do.
make
Start the build, and find something else to do for a few hours:
ninja -v
You can copy the results into wordfreq with this command:
cp data/dist/*.msgpack.gz ../wordfreq/data/
## The Ninja build process
Ninja is a lot like Make, except with one big {drawback|advantage}: instead of
writing bizarre expressions in an idiosyncratic language to let Make calculate
which files depend on which other files...
...you just tell Ninja which files depend on which other files.
The Ninja documentation suggests using your favorite scripting language to
create the dependency list, so that's what we've done in `ninja.py`.
Dependencies in Ninja refer to build rules. These do need to be written by hand
in Ninja's own format, but the task is simpler. In this project, the build
rules are defined in `rules.ninja`. They'll be concatenated with the
Python-generated dependency definitions to form the complete build file,
`build.ninja`, which is the default file that Ninja looks at when you run
`ninja`.
So a lot of the interesting work in this package is done in `rules.ninja`.
This file defines shorthand names for long commands. As a simple example,
the rule named `format_twitter` applies the command
python -m wordfreq_builder.cli.format_twitter $in $out
to the dependency file `$in` and the output file `$out`.
The specific rules are described by the comments in `rules.ninja`.
## Data sources
### Leeds Internet Corpus
Also known as the "Web as Corpus" project, this is a University of Leeds
project that collected wordlists in assorted languages by crawling the Web.
The results are messy, but they're something. We've been using them for quite
a while.
These files can be downloaded from the [Leeds corpus page][leeds].
The original files are in `data/source-lists/leeds`, and they're processed
by the `convert_leeds` rule in `rules.ninja`.
[leeds]: http://corpus.leeds.ac.uk/list.html
### Twitter
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
collected by the `ftfy.streamtester` package in 2014.
We are not allowed to distribute the text of tweets. However, this process could
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
a couple of weeks.
[ftfy]: https://github.com/LuminosoInsight/python-ftfy
### Google Books
We use English word frequencies from [Google Books Syntactic Ngrams][gbsn].
We pretty much ignore the syntactic information, and only use this version
because it's cleaner. The data comes in the form of 99 gzipped text files in
`data/raw-input/google-books`.
[gbsn]: http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html
### Wikipedia
Another source we use is the full text of Wikipedia in various languages. This
text can be difficult to extract efficiently, and for this purpose we use a
custom tool written in Nim 0.11, called [wiki2text][]. To build the Wikipedia
data, you need to separately install Nim and wiki2text.
The input data files are the XML dumps that can be found on the [Wikimedia
backup index][wikidumps]. For example, to get the latest Spanish data, go to
https://dumps.wikimedia.org/frwiki/latest and look for the filename of the form
`*.pages-articles.xml.bz2`. If this file isn't there, look for an older dump
where it is. You'll need to download such a file for each language that's
configured for Wikipedia in `wordfreq_builder/config.py`.
[wiki2text]: https://github.com/rspeer/wiki2text
[wikidumps]: https://dumps.wikimedia.org/backup-index.html
### OpenSubtitles
[Hermit Dave](https://invokeit.wordpress.com/frequency-word-lists/) made word
frequency lists out of the subtitle text on OpenSubtitles. This data was
used to make Wiktionary word frequency lists at one point, but it's been
updated significantly since the version Wiktionary got.
The wordlists are in `data/source-lists/opensubtitles`.
In order to fit into the wordfreq pipeline, we renamed lists with different variants
of the same language code, to distinguish them fully according to BCP 47. Then we
concatenated the different variants into a single list, as follows:
* `zh_tw.txt` was renamed to `zh-Hant.txt`
* `zh_cn.txt` was renamed to `zh-Hans.txt`
* `zh.txt` was renamed to `zh-Hani.txt`
* `zh-Hant.txt`, `zh-Hans.txt`, and `zh-Hani.txt` were concatenated into `zh.txt`
* `pt.txt` was renamed to `pt-PT.txt`
* `pt_br.txt` was renamed to `pt-BR.txt`
* `pt-BR.txt` and `pt-PT.txt` were concatenated into `pt.txt`
We also edited the English data to re-add "'t" to words that had obviously lost
it, such as "didn" in the place of "didn't". We applied this to words that
became much less common words in the process, which means this wordlist no
longer represents the words 'don' and 'won', as we assume most of their
frequency comes from "don't" and "won't". Words that turned into similarly
common words, however, were left alone: this list doesn't represent "can't"
because the word was left as "can".
### SUBTLEX
Marc Brysbaert gave us permission by e-mail to use the SUBTLEX word lists in
wordfreq and derived works without the "academic use" restriction, under the
following reasonable conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
(See the citations in the top-level `README.md` file.)
- It must remain clear that SUBTLEX is freely available data.
`data/source-lists/subtlex` contains the following files:
- `subtlex.de.txt`, which was downloaded as [SUBTLEX-DE raw file.xlsx][subtlex-de],
and exported from Excel format to tab-separated UTF-8 using LibreOffice
- `subtlex.en-US.txt`, which was downloaded as [subtlexus5.zip][subtlex-us],
extracted, and converted from ISO-8859-1 to UTF-8
- `subtlex.en-GB.txt`, which was downloaded as
[SUBTLEX-UK\_all.xlsx][subtlex-uk], and exported from Excel format to
tab-separated UTF-8 using LibreOffice
- `subtlex.nl.txt`, which was downloaded as
[SUBTLEX-NL.cd-above2.txt.zip][subtlex-nl] and extracted
- `subtlex.zh.txt`, which was downloaded as
[subtlexch131210.zip][subtlex-ch] and extracted
[subtlex-de]: http://crr.ugent.be/SUBTLEX-DE/SUBTLEX-DE%20raw%20file.xlsx
[subtlex-us]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus5.zip
[subtlex-uk]: http://crr.ugent.be/papers/SUBTLEX-UK_all.xlsx
[subtlex-nl]: http://crr.ugent.be/subtlex-nl/SUBTLEX-NL.cd-above2.txt.zip
[subtlex-ch]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/subtlexch131210.zip

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

Binary file not shown.

View File

@ -1,117 +0,0 @@
# This defines the rules on how to build parts of the wordfreq lists, using the
# Ninja build system:
#
# http://martine.github.io/ninja/manual.html
#
# Ninja is available in the 'ninja-build' Ubuntu package. It's like make with
# better parallelism and the ability for build steps to produce multiple
# outputs. The tradeoff is that its rule syntax isn't full of magic for
# expanding wildcards and finding dependencies, so in general you have to
# write the dependencies using a script.
#
# This file will become the header of the larger build.ninja file, which also
# contains the programatically-defined dependency graph.
# Variables
JQ = lib/jq-linux64
# How to build the build.ninja file itself. (Use the Makefile to get it the
# first time.)
rule build_deps
command = python -m wordfreq_builder.cli.build_deps $in > $out
# Splits the single file $in into $slices parts, whose names will be
# $prefix plus a two-digit numeric suffix.
rule split
command = mkdir -p $$(dirname $prefix) && split -d -n r/$slices $in $prefix
# wiki2text is a tool I wrote using Nim 0.11, which extracts plain text from
# Wikipedia dumps obtained from dumps.wikimedia.org. The code is at
# https://github.com/rspeer/wiki2text.
rule wiki2text
command = bunzip2 -c $in | wiki2text > $out
# To tokenize Japanese, we run it through Mecab and take the first column.
rule tokenize_japanese
command = mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out
# Process Chinese by converting all Traditional Chinese characters to
# Simplified equivalents -- not because that's a good way to get readable
# text, but because that's how we're going to look them up.
rule simplify_chinese
command = python -m wordfreq_builder.cli.simplify_chinese < $in > $out
# Tokenizing text from Twitter requires us to language-detect and tokenize
# in the same step.
rule tokenize_twitter
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix
rule tokenize_reddit
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_reddit $in $prefix
# To convert the Leeds corpus, look for space-separated lines that start with
# an integer and a decimal. The integer is the rank, which we discard. The
# decimal is the frequency, and the remaining text is the term. Use sed -n
# with /p to output only lines where the match was successful.
#
# Grep out the term "EOS", an indication that Leeds used MeCab and didn't
# strip out the EOS lines.
rule convert_leeds
command = sed -rn 's/([0-9]+) ([0-9.]+) (.*)/\3,\2/p' < $in | grep -v 'EOS,' > $out
# To convert the OpenSubtitles frequency data, simply replace spaces with
# commas.
rule convert_opensubtitles
command = tr ' ' ',' < $in > $out
# To convert SUBTLEX, we take the 1st and Nth columns, strip the header,
# run it through ftfy, convert tabs to commas and spurious CSV formatting to
# spaces, and remove lines with unfixable half-mojibake.
rule convert_subtlex
command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr ' ",' ', ' | grep -v 'â,' > $out
rule convert_jieba
command = cut -d ' ' -f 1,2 $in | grep -v '[,"]' | tr ' ' ',' > $out
rule counts_to_jieba
command = python -m wordfreq_builder.cli.counts_to_jieba $in $out
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
# the input files, keep only the single words and their counts, and only keep
# lines with counts of 100 or more.
#
# (These will still be repeated as the word appears in different grammatical
# roles, information that the source data provides that we're discarding. The
# source data was already filtered to only show words in roles with at least
# two-digit counts of occurences.)
rule convert_google_syntactic_ngrams
command = zcat $in | cut -f 1,3 | grep -v '[,"]' | sed -rn 's/(.*)\s(...+)/\1,\2/p' > $out
rule count
command = python -m wordfreq_builder.cli.count_tokens $in $out
rule count_langtagged
command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
rule merge
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
rule merge_counts
command = python -m wordfreq_builder.cli.merge_counts -o $out -c $cutoff $in
rule freqs2cB
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out -b $buckets
rule cat
command = cat $in > $out
# A pipeline that extracts text from Reddit comments:
# - Unzip the input files
# - Select the body of comments, but only those whose Reddit score is positive
# (skipping the downvoted ones)
# - Skip deleted comments
# - Replace HTML escapes
rule extract_reddit
command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/&gt;/>/g' | sed 's/&lt;/</g' | sed 's/&amp;/\&/g' > $out

View File

@ -1,13 +0,0 @@
from setuptools import setup
setup(
name="wordfreq_builder",
version='0.2',
maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq_builder',
platforms=["any"],
description="Turns raw data into word frequency lists",
packages=['wordfreq_builder'],
install_requires=['msgpack-python', 'pycld2', 'langcodes']
)

View File

@ -1,51 +0,0 @@
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, cld2_detect_language
from nose.tools import eq_
def test_tokenizer_1():
text = '"This is a test," she said, "and I\'ll bet y\'all $3.50 that it won\'t fail."'
tokens = [
'this', 'is', 'a', 'test', 'she', 'said',
'and', "i'll", 'bet', "y", "all", '3.50', 'that',
'it', "won't", 'fail',
]
result = cld2_surface_tokenizer(text)
eq_(result[1], tokens)
eq_(result[0], 'en')
def test_tokenizer_2():
text = "i use punctuation informally...see?like this."
tokens = [
'i', 'use', 'punctuation', 'informally', 'see',
'like', 'this'
]
result = cld2_surface_tokenizer(text)
eq_(result[1], tokens)
eq_(result[0], 'en')
def test_tokenizer_3():
text = "@ExampleHandle This parser removes twitter handles!"
tokens = ['this', 'parser', 'removes', 'twitter', 'handles']
result = cld2_surface_tokenizer(text)
eq_(result[1], tokens)
eq_(result[0], 'en')
def test_tokenizer_4():
text = "This is a really boring example tco http://t.co/n15ASlkase"
tokens = ['this', 'is', 'a', 'really', 'boring', 'example', 'tco']
result = cld2_surface_tokenizer(text)
eq_(result[1], tokens)
eq_(result[0], 'en')
def test_language_recognizer_1():
text = "Il est le meilleur livre que je ai jamais lu"
result = cld2_detect_language(text)
eq_(result, 'fr')
def test_language_recognizer_2():
text = """A nuvem de Oort, também chamada de nuvem de Öpik-Oort,
é uma nuvem esférica de planetesimais voláteis que se acredita
localizar-se a cerca de 50 000 UA, ou quase um ano-luz, do Sol."""
result = cld2_detect_language(text)
eq_(result, 'pt')

View File

@ -1,20 +0,0 @@
from wordfreq_builder.word_counts import URL_RE
from nose.tools import eq_
def check_url(url):
match = URL_RE.match(url)
assert match
eq_(match.span(), (0, len(url)))
def test_url_re():
# URLs like this are all over the Arabic Wikipedia. Here's one with the
# student ID blanked out.
yield check_url, 'http://www.ju.edu.jo/alumnicard/0000000.aspx'
yield check_url, 'https://example.com/űnicode.html'
yield check_url, 'http://☃.net'
assert not URL_RE.match('ftp://127.0.0.1')

View File

@ -1,15 +0,0 @@
from wordfreq_builder.ninja import make_ninja_deps
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('in_filename', help='filename of rules file')
args = parser.parse_args()
# Make the complete ninja file and write it to standard out
make_ninja_deps(args.in_filename)
if __name__ == '__main__':
main()

View File

@ -1,15 +0,0 @@
from wordfreq_builder.word_counts import count_tokens, write_wordlist
import argparse
def handle_counts(filename_in, filename_out):
counts = count_tokens(filename_in)
write_wordlist(counts, filename_out)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
args = parser.parse_args()
handle_counts(args.filename_in, args.filename_out)

View File

@ -1,21 +0,0 @@
"""
Count tokens of text in a particular language, taking input from a
tab-separated file whose first column is a language code. Lines in all
languages except the specified one will be skipped.
"""
from wordfreq_builder.word_counts import count_tokens_langtagged, write_wordlist
import argparse
def handle_counts(filename_in, filename_out, lang):
counts = count_tokens_langtagged(filename_in, lang)
write_wordlist(counts, filename_out)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
parser.add_argument('-l', '--language', help='language tag to filter lines for')
args = parser.parse_args()
handle_counts(args.filename_in, args.filename_out, args.language)

View File

@ -1,15 +0,0 @@
from wordfreq_builder.word_counts import read_values, write_jieba
import argparse
def handle_counts(filename_in, filename_out):
freqs, total = read_values(filename_in, cutoff=1e-6)
write_jieba(freqs, filename_out)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input wordlist')
parser.add_argument('filename_out', help='name of output Jieba-compatible wordlist')
args = parser.parse_args()
handle_counts(args.filename_in, args.filename_out)

View File

@ -1,14 +0,0 @@
from wordfreq_builder.word_counts import freqs_to_cBpack
import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
parser.add_argument('-b', '--buckets', type=int, default=600,
help='Number of centibel buckets to include (default 600). '
'Increasing this number creates a longer wordlist with '
'rarer words.')
args = parser.parse_args()
freqs_to_cBpack(args.filename_in, args.filename_out, cutoff=-(args.buckets))

View File

@ -1,25 +0,0 @@
from wordfreq_builder.word_counts import read_values, merge_counts, write_wordlist
import argparse
def merge_lists(input_names, output_name, cutoff=0, max_words=1000000):
count_dicts = []
for input_name in input_names:
values, total = read_values(input_name, cutoff=cutoff, max_words=max_words)
count_dicts.append(values)
merged = merge_counts(count_dicts)
write_wordlist(merged, output_name)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', default='combined-counts.csv',
help='filename to write the output to')
parser.add_argument('-c', '--cutoff', type=int, default=0,
help='minimum count to read from an input file')
parser.add_argument('-m', '--max-words', type=int, default=1000000,
help='maximum number of words to read from each list')
parser.add_argument('inputs', nargs='+',
help='names of input files to merge')
args = parser.parse_args()
merge_lists(args.inputs, args.output, cutoff=args.cutoff, max_words=args.max_words)

View File

@ -1,31 +0,0 @@
from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
import argparse
def merge_lists(input_names, output_name, cutoff, lang):
freq_dicts = []
# Don't use Chinese tokenization while building wordlists, as that would
# create a circular dependency.
if lang == 'zh':
lang = None
for input_name in input_names:
freq_dicts.append(read_freqs(input_name, cutoff=cutoff, lang=lang))
merged = merge_freqs(freq_dicts)
write_wordlist(merged, output_name)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', default='combined-freqs.csv',
help='filename to write the output to')
parser.add_argument('-c', '--cutoff', type=int, default=2,
help='stop after seeing a count below this')
parser.add_argument('-l', '--language', default=None,
help='language code for which language the words are in')
parser.add_argument('inputs', nargs='+',
help='names of input files to merge')
args = parser.parse_args()
merge_lists(args.inputs, args.output, args.cutoff, args.language)

View File

@ -1,11 +0,0 @@
from wordfreq.chinese import simplify_chinese
import sys
def main():
for line in sys.stdin:
sys.stdout.write(simplify_chinese(line))
if __name__ == '__main__':
main()

View File

@ -1,18 +0,0 @@
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_by_language
import argparse
def reddit_tokenizer(text):
return cld2_surface_tokenizer(text, mode='reddit')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename', help='filename of input file containing one comment per line')
parser.add_argument('outprefix', help='prefix of output filenames')
args = parser.parse_args()
tokenize_by_language(args.filename, args.outprefix, tokenizer=reddit_tokenizer)
if __name__ == '__main__':
main()

View File

@ -1,14 +0,0 @@
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_by_language
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename', help='filename of input file containing one tweet per line')
parser.add_argument('outprefix', help='prefix of output filenames')
args = parser.parse_args()
tokenize_by_language(args.filename, args.outprefix, tokenizer=cld2_surface_tokenizer)
if __name__ == '__main__':
main()

View File

@ -1,131 +0,0 @@
import os
CONFIG = {
# data_dir is a relative or absolute path to where the wordlist data
# is stored
'data_dir': 'data',
'sources': {
# A list of language codes that we'll look up in filenames for these
# various data sources.
#
# Consider adding:
# 'th' when we get tokenization for it
# 'tl' with one more data source
# 'el' if we can filter out kaomoji
'twitter': [
'ar', 'ca', 'de', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
'ja', 'ko', 'ms', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr'
],
# Languages with large Wikipedias. (Languages whose Wikipedia dump is
# at least 200 MB of .xml.bz2 are included. Some widely-spoken
# languages with 100 MB are also included, specifically Malay and
# Hindi.)
'wikipedia': [
'ar', 'ca', 'de', 'el', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
'ja', 'ko', 'ms', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh',
'bg', 'da', 'fi', 'hu', 'ro', 'uk'
],
'opensubtitles': [
# This list includes languages where the most common word in
# OpenSubtitles appears at least 5000 times. However, we exclude
# languages where SUBTLEX has apparently done a better job,
# specifically German and Chinese.
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
'sr', 'sv', 'tr', 'uk'
],
'leeds': [
'ar', 'de', 'el', 'en', 'es', 'fr', 'it', 'ja', 'pt', 'ru', 'zh'
],
'google-books': [
'en',
# Using the 2012 data, we could get French, German, Italian,
# Russian, Spanish, and (Simplified) Chinese.
],
'subtlex-en': ['en'],
'subtlex-other': ['de', 'nl', 'zh'],
'jieba': ['zh'],
# About 99.2% of Reddit is in English. There are pockets of
# conversation in other languages, some of which may not be
# representative enough for learning general word frequencies.
#
# However, there seem to be Spanish subreddits that are general enough
# (including /r/es and /r/mexico).
'reddit': ['en', 'es'],
# Well-represented languages in the Common Crawl
# It's possible we could add 'uk' to the list, needs more checking
'commoncrawl': [
'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'fa', 'fi', 'fr',
'he', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'ms', 'nb', 'nl',
'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'ta', 'tr', 'vi', 'zh'
],
},
# Subtlex languages that need to be pre-processed
'wordlist_paths': {
'twitter': 'generated/twitter/tweets-2014.{lang}.{ext}',
'wikipedia': 'generated/wikipedia/wikipedia_{lang}.{ext}',
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
'commoncrawl': 'generated/commoncrawl/commoncrawl_{lang}.{ext}',
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
'reddit': 'generated/reddit/reddit_{lang}.{ext}',
'combined': 'generated/combined/combined_{lang}.{ext}',
'combined-dist': 'dist/combined_{lang}.{ext}',
'combined-dist-large': 'dist/large_{lang}.{ext}',
'twitter-dist': 'dist/twitter_{lang}.{ext}',
'jieba-dist': 'dist/jieba_{lang}.{ext}'
},
'min_sources': 3,
'big-lists': ['en', 'fr', 'es', 'pt', 'de', 'ar', 'it', 'nl', 'ru'],
# When dealing with language tags that come straight from cld2, we need
# to un-standardize a few of them
'cld2-language-aliases': {
'nb': 'no',
'he': 'iw',
'jw': 'jv'
}
}
def data_filename(filename):
"""
Convert a relative filename to a path inside the configured data_dir.
"""
return os.path.join(CONFIG['data_dir'], filename)
def wordlist_filename(source, language, extension='txt'):
"""
Get the path where a particular built wordlist should go, parameterized by
its language and its file extension.
"""
path = CONFIG['wordlist_paths'][source].format(
lang=language, ext=extension
)
return data_filename(path)
def source_names(language):
"""
Get the names of data sources that supply data for the given language.
"""
return sorted(key for key in CONFIG['sources']
if language in CONFIG['sources'][key])
def all_languages():
"""
Get all languages that should have their data built, which is those that
are supported by at least `min_sources` sources.
"""
languages = set()
for langlist in CONFIG['sources'].values():
languages |= set(langlist)
return [lang for lang in sorted(languages)
if len(source_names(lang)) >= CONFIG['min_sources']]

View File

@ -1,421 +0,0 @@
from wordfreq_builder.config import (
CONFIG, data_filename, wordlist_filename, all_languages, source_names
)
import sys
import pathlib
import itertools
from collections import defaultdict
HEADER = """# This file is automatically generated. Do not edit it.
# You can change its behavior by editing wordfreq_builder/ninja.py,
# and regenerate it by running 'make'.
"""
TMPDIR = data_filename('tmp')
def add_dep(lines, rule, input, output, extra=None, params=None):
if isinstance(output, list):
output = ' '.join(output)
if isinstance(input, list):
input = ' '.join(input)
if extra:
if isinstance(extra, list):
extra = ' '.join(extra)
extrastr = ' | ' + extra
else:
extrastr = ''
build_rule = "build {output}: {rule} {input}{extra}".format(
output=output, rule=rule, input=input, extra=extrastr
)
lines.append(build_rule)
if params:
for key, val in params.items():
lines.append(" {key} = {val}".format(key=key, val=val))
lines.append("")
def make_ninja_deps(rules_filename, out=sys.stdout):
"""
Output a complete Ninja file describing how to build the wordfreq data.
"""
print(HEADER, file=out)
# Copy in the rules section
with open(rules_filename, encoding='utf-8') as rulesfile:
print(rulesfile.read(), file=out)
lines = []
# The first dependency is to make sure the build file is up to date.
add_dep(lines, 'build_deps', 'rules.ninja', 'build.ninja',
extra='wordfreq_builder/ninja.py')
lines.extend(itertools.chain(
twitter_deps(
data_filename('raw-input/twitter/all-2014.txt'),
slice_prefix=data_filename('slices/twitter/tweets-2014'),
combined_prefix=data_filename('generated/twitter/tweets-2014'),
slices=40,
languages=CONFIG['sources']['twitter']
),
wikipedia_deps(
data_filename('raw-input/wikipedia'),
CONFIG['sources']['wikipedia']
),
google_books_deps(
data_filename('raw-input/google-books')
),
leeds_deps(
data_filename('source-lists/leeds'),
CONFIG['sources']['leeds']
),
opensubtitles_deps(
data_filename('source-lists/opensubtitles'),
CONFIG['sources']['opensubtitles']
),
subtlex_en_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-en']
),
subtlex_other_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-other']
),
reddit_deps(
data_filename('raw-input/reddit'),
CONFIG['sources']['reddit']
),
jieba_deps(
data_filename('source-lists/jieba'),
CONFIG['sources']['jieba']
),
commoncrawl_deps(
data_filename('raw-input/commoncrawl'),
CONFIG['sources']['commoncrawl']
),
combine_lists(all_languages())
))
print('\n'.join(lines), file=out)
def wikipedia_deps(dirname_in, languages):
lines = []
path_in = pathlib.Path(dirname_in)
for language in languages:
# Find the most recent file for this language
input_file = max(path_in.glob('{}wiki*.bz2'.format(language)))
plain_text_file = wordlist_filename('wikipedia', language, 'txt')
count_file = wordlist_filename('wikipedia', language, 'counts.txt')
add_dep(lines, 'wiki2text', input_file, plain_text_file)
if language == 'ja':
mecab_token_file = wordlist_filename(
'wikipedia', language, 'mecab-tokens.txt'
)
add_dep(
lines, 'tokenize_japanese', plain_text_file, mecab_token_file
)
add_dep(lines, 'count', mecab_token_file, count_file)
else:
add_dep(lines, 'count', plain_text_file, count_file)
return lines
def commoncrawl_deps(dirname_in, languages):
lines = []
for language in languages:
if language in CONFIG['cld2-language-aliases']:
language_alias = CONFIG['cld2-language-aliases'][language]
else:
language_alias = language
input_file = dirname_in + '/{}.txt.gz'.format(language_alias)
count_file = wordlist_filename('commoncrawl', language, 'counts.txt')
add_dep(lines, 'count_langtagged', input_file, count_file, params={'language': language_alias})
return lines
def google_books_deps(dirname_in):
# Get English data from the split-up files of the Google Syntactic N-grams
# 2013 corpus.
lines = []
# Yes, the files are numbered 00 through 98 of 99. This is not an
# off-by-one error. Not on my part, anyway.
input_files = [
'{}/nodes.{:>02d}-of-99.gz'.format(dirname_in, i)
for i in range(99)
]
output_file = wordlist_filename('google-books', 'en', 'counts.txt')
add_dep(lines, 'convert_google_syntactic_ngrams', input_files, output_file)
return lines
def twitter_deps(input_filename, slice_prefix, combined_prefix, slices,
languages):
lines = []
slice_files = ['{prefix}.part{num:0>2d}'.format(prefix=slice_prefix,
num=num)
for num in range(slices)]
# split the input into slices
add_dep(lines, 'split', input_filename, slice_files,
params={'prefix': '{}.part'.format(slice_prefix),
'slices': slices})
for slicenum in range(slices):
slice_file = slice_files[slicenum]
language_outputs = [
'{prefix}.{lang}.txt'.format(prefix=slice_file, lang=language)
for language in languages
]
add_dep(lines, 'tokenize_twitter', slice_file, language_outputs,
params={'prefix': slice_file},
extra='wordfreq_builder/tokenizers.py')
for language in languages:
combined_output = wordlist_filename('twitter', language, 'tokens.txt')
language_inputs = [
'{prefix}.{lang}.txt'.format(
prefix=slice_files[slicenum], lang=language
)
for slicenum in range(slices)
]
add_dep(lines, 'cat', language_inputs, combined_output)
count_file = wordlist_filename('twitter', language, 'counts.txt')
if language == 'ja':
mecab_token_file = wordlist_filename(
'twitter', language, 'mecab-tokens.txt')
add_dep(
lines, 'tokenize_japanese', combined_output, mecab_token_file)
combined_output = mecab_token_file
add_dep(lines, 'count', combined_output, count_file,
extra='wordfreq_builder/tokenizers.py')
return lines
def leeds_deps(dirname_in, languages):
lines = []
for language in languages:
input_file = '{prefix}/internet-{lang}-forms.num'.format(
prefix=dirname_in, lang=language
)
if language == 'zh':
step2_file = wordlist_filename('leeds', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
reformatted_file = wordlist_filename('leeds', language, 'counts.txt')
add_dep(lines, 'convert_leeds', step2_file, reformatted_file)
return lines
def opensubtitles_deps(dirname_in, languages):
lines = []
for language in languages:
input_file = '{prefix}/{lang}.txt'.format(
prefix=dirname_in, lang=language
)
if language == 'zh':
step2_file = wordlist_filename('opensubtitles', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
reformatted_file = wordlist_filename(
'opensubtitles', language, 'counts.txt'
)
add_dep(lines, 'convert_opensubtitles', step2_file, reformatted_file)
return lines
def jieba_deps(dirname_in, languages):
lines = []
# Because there's Chinese-specific handling here, the valid options for
# 'languages' are [] and ['zh']. Make sure it's one of those.
if not languages:
return lines
assert languages == ['zh']
input_file = '{prefix}/dict.txt.big'.format(prefix=dirname_in)
transformed_file = wordlist_filename(
'jieba', 'zh-Hans', 'converted.txt'
)
reformatted_file = wordlist_filename(
'jieba', 'zh', 'counts.txt'
)
add_dep(lines, 'simplify_chinese', input_file, transformed_file)
add_dep(lines, 'convert_jieba', transformed_file, reformatted_file)
return lines
def reddit_deps(dirname_in, languages):
lines = []
path_in = pathlib.Path(dirname_in)
slices = {}
counts_by_language = defaultdict(list)
# Extract text from the Reddit comment dumps, and write them to
# .txt.gz files
for filepath in path_in.glob('*/*.bz2'):
base = filepath.stem
transformed_file = wordlist_filename('reddit', base + '.all', 'txt')
slices[base] = transformed_file
add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
for base in sorted(slices):
transformed_file = slices[base]
language_outputs = []
for language in languages:
filename = wordlist_filename('reddit', base + '.' + language, 'txt')
language_outputs.append(filename)
count_filename = wordlist_filename('reddit', base + '.' + language, 'counts.txt')
add_dep(lines, 'count', filename, count_filename)
counts_by_language[language].append(count_filename)
# find the prefix by constructing a filename, then stripping off
# '.xx.txt' from the end
prefix = wordlist_filename('reddit', base + '.xx', 'txt')[:-7]
add_dep(lines, 'tokenize_reddit', transformed_file, language_outputs,
params={'prefix': prefix},
extra='wordfreq_builder/tokenizers.py')
for language in languages:
output_file = wordlist_filename('reddit', language, 'counts.txt')
add_dep(
lines, 'merge_counts', counts_by_language[language], output_file,
params={'cutoff': 3}
)
return lines
# Which columns of the SUBTLEX data files do the word and its frequency appear
# in?
SUBTLEX_COLUMN_MAP = {
'de': (1, 3),
'el': (2, 3),
'en': (1, 2),
'nl': (1, 2),
'zh': (1, 5)
}
def subtlex_en_deps(dirname_in, languages):
lines = []
# Either subtlex_en is turned off, or it's just in English
if not languages:
return lines
assert languages == ['en']
regions = ['en-US', 'en-GB']
processed_files = []
for region in regions:
input_file = '{prefix}/subtlex.{region}.txt'.format(
prefix=dirname_in, region=region
)
textcol, freqcol = SUBTLEX_COLUMN_MAP['en']
processed_file = wordlist_filename('subtlex-en', region, 'processed.txt')
processed_files.append(processed_file)
add_dep(
lines, 'convert_subtlex', input_file, processed_file,
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
)
output_file = wordlist_filename('subtlex-en', 'en', 'counts.txt')
add_dep(
lines, 'merge_counts', processed_files, output_file,
params={'cutoff': 0}
)
return lines
def subtlex_other_deps(dirname_in, languages):
lines = []
for language in languages:
input_file = '{prefix}/subtlex.{lang}.txt'.format(
prefix=dirname_in, lang=language
)
processed_file = wordlist_filename('subtlex-other', language, 'processed.txt')
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
if language == 'zh':
step2_file = wordlist_filename('subtlex-other', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
# I hope we don't need to configure this by language anymore.
add_dep(
lines, 'convert_subtlex', step2_file, processed_file,
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
)
add_dep(
lines, 'merge_counts', processed_file, output_file,
params={'cutoff': 0}
)
return lines
def combine_lists(languages):
lines = []
for language in languages:
sources = source_names(language)
input_files = [
wordlist_filename(source, language, 'counts.txt')
for source in sources
]
output_file = wordlist_filename('combined', language)
add_dep(lines, 'merge', input_files, output_file,
extra='wordfreq_builder/word_counts.py',
params={'cutoff': 2, 'lang': language})
output_cBpack = wordlist_filename(
'combined-dist', language, 'msgpack.gz'
)
output_cBpack_big = wordlist_filename(
'combined-dist-large', language, 'msgpack.gz'
)
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
extra='wordfreq_builder/word_counts.py',
params={'lang': language, 'buckets': 600})
add_dep(lines, 'freqs2cB', output_file, output_cBpack_big,
extra='wordfreq_builder/word_counts.py',
params={'lang': language, 'buckets': 800})
lines.append('default {}'.format(output_cBpack))
if language in CONFIG['big-lists']:
lines.append('default {}'.format(output_cBpack_big))
# Write standalone lists for Twitter frequency
if language in CONFIG['sources']['twitter']:
input_file = wordlist_filename('twitter', language, 'counts.txt')
output_cBpack = wordlist_filename(
'twitter-dist', language, 'msgpack.gz')
add_dep(lines, 'freqs2cB', input_file, output_cBpack,
extra='wordfreq_builder/word_counts.py',
params={'lang': language, 'buckets': 600})
lines.append('default {}'.format(output_cBpack))
# Write a Jieba-compatible frequency file for Chinese tokenization
chinese_combined = wordlist_filename('combined', 'zh')
jieba_output = wordlist_filename('jieba-dist', 'zh')
add_dep(lines, 'counts_to_jieba', chinese_combined, jieba_output,
extra=['wordfreq_builder/word_counts.py', 'wordfreq_builder/cli/counts_to_jieba.py'])
lines.append('default {}'.format(jieba_output))
return lines
def main():
make_ninja_deps('rules.ninja')
if __name__ == '__main__':
main()

View File

@ -1,132 +0,0 @@
from wordfreq import tokenize
from ftfy.fixes import unescape_html
import regex
import pycld2
import langcodes
CLD2_BAD_CHAR_RANGE = "[%s]" % "".join(
[
'\x00-\x08',
'\x0b',
'\x0e-\x1f',
'\x7f-\x9f',
'\ud800-\udfff',
'\ufdd0-\ufdef',
'\N{HANGUL FILLER}',
'\N{HANGUL CHOSEONG FILLER}',
'\N{HANGUL JUNGSEONG FILLER}',
'<>'
] +
[chr(65534+65536*x+y) for x in range(17) for y in range(2)]
)
CLD2_BAD_CHARS_RE = regex.compile(CLD2_BAD_CHAR_RANGE)
TWITTER_HANDLE_RE = regex.compile(r'@[\S--\p{punct}]+')
TCO_RE = regex.compile('http(?:s)?://t.co/[a-zA-Z0-9]+')
URL_RE = regex.compile(r'http(?:s)?://[^) ]*')
MARKDOWN_URL_RESIDUE_RE = regex.compile(r'\]\(\)')
# Low-frequency languages tend to be detected incorrectly by cld2. The
# following list of languages are languages that appear in our data with any
# reasonable frequency, and seem to usually be detected *correctly*. These are
# the languages we'll keep in the Reddit and Twitter results.
#
# This list is larger than the list that wordfreq ultimately generates, so we
# can look here as a source of future data.
KEEP_THESE_LANGUAGES = {
'af', 'ar', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fi',
'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'ko', 'lv',
'ms', 'nl', 'nn', 'no', 'pl', 'pt', 'ro', 'ru', 'sr', 'sv', 'sw', 'tl',
'tr', 'uk', 'vi'
}
# Semi-frequent languages that are excluded by the above:
#
# - Chinese, not because it's detected incorrectly, but because we can't
# handle it until we already have word frequencies
# - Thai (seems to be detected whenever someone uses Thai characters in
# an emoticon)
# - Welsh (which is detected for "ohmygodohmygodohmygod")
# - Turkmen (detected for ASCII art)
# - Irish Gaelic (detected for Cthulhu-related text)
# - Kannada (looks of disapproval)
# - Lao, Tamil, Xhosa, Slovak (various emoticons and Internet memes)
# - Breton (the word "memes" itself)
def cld2_surface_tokenizer(text, mode='twitter'):
"""
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
The `mode` can be 'twitter' or 'reddit', which slightly changes the
pre-processing of the text.
"""
text = unescape_html(text)
if mode == 'twitter':
text = TWITTER_HANDLE_RE.sub('', text)
text = TCO_RE.sub('', text)
elif mode == 'reddit':
text = URL_RE.sub('', text)
text = MARKDOWN_URL_RESIDUE_RE.sub(']', text)
lang = cld2_detect_language(text)
# If the detected language isn't in our pretty generous list of languages,
# return no tokens.
if lang not in KEEP_THESE_LANGUAGES:
return 'xx', []
# cld2's accuracy seems to improve dramatically with at least 50
# bytes of input, so throw away non-English below this length.
if len(text.encode('utf-8')) < 50 and lang != 'en':
return 'xx', []
tokens = tokenize(text, lang)
return lang, tokens
def cld2_detect_language(text):
"""
Uses CLD2 to detect the language.
"""
# Format of pycld2.detect:
# (Confident in result: bool,
# Number of bytes of text: Int,
# Triples of detected languages in order of certainty:
# (Language name: str,
# Language code: str
# Percent of text in this language: float
# Confidence score: float))
text = CLD2_BAD_CHARS_RE.sub('', text)
lang = pycld2.detect(text)[2][0][1]
# Normalize the language code: 'iw' becomes 'he', and 'zh-Hant'
# becomes 'zh'
code = langcodes.get(lang).language
return code
def tokenize_by_language(in_filename, out_prefix, tokenizer):
"""
Process a file by running it through a given tokenizer.
Produces output files that are separated by language, with spaces
between the tokens.
"""
out_files = {
language: open('%s.%s.txt' % (out_prefix, language), 'w', encoding='utf-8')
for language in KEEP_THESE_LANGUAGES
}
with open(in_filename, encoding='utf-8') as in_file:
for line in in_file:
text = line.split('\t')[-1].strip()
language, tokens = tokenizer(text)
if language in KEEP_THESE_LANGUAGES:
out_file = out_files[language]
tokenized = ' '.join(tokens)
print(tokenized, file=out_file)
for out_file in out_files.values():
out_file.close()

View File

@ -1,289 +0,0 @@
from wordfreq import simple_tokenize, tokenize
from collections import defaultdict
from operator import itemgetter
from ftfy import fix_text
import statistics
import math
import csv
import msgpack
import gzip
import unicodedata
import regex
# Match common cases of URLs: the schema http:// or https:// followed by
# non-whitespace characters.
URL_RE = regex.compile(r'https?://(?:\S)+')
HAN_RE = regex.compile(r'[\p{Script=Han}]+')
def count_tokens(filename):
"""
Count tokens that appear in a file, running each line through our
simple tokenizer.
URLs will be skipped, and Unicode errors will become separate tokens
containing '<EFBFBD>'.
"""
counts = defaultdict(int)
if filename.endswith('gz'):
infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
else:
infile = open(filename, encoding='utf-8', errors='replace')
for line in infile:
line = URL_RE.sub('', line.strip())
for token in simple_tokenize(line):
counts[token] += 1
infile.close()
return counts
def count_tokens_langtagged(filename, lang):
"""
Count tokens that appear in an already language-tagged file, in which each
line begins with a language code followed by a tab.
"""
counts = defaultdict(int)
if filename.endswith('gz'):
infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
else:
infile = open(filename, encoding='utf-8', errors='replace')
for line in infile:
if '\t' not in line:
continue
line_lang, text = line.split('\t', 1)
if line_lang == lang:
tokens = tokenize(text.strip(), lang)
for token in tokens:
counts[token] += 1
infile.close()
return counts
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
"""
Read words and their frequency or count values from a CSV file. Returns
a dictionary of values and the total of all values.
Only words with a value greater than or equal to `cutoff` are returned.
In addition, only up to `max_words` words are read.
If `cutoff` is greater than 0 or `max_words` is smaller than the list,
the csv file must be sorted by value in descending order, so that the
most frequent words are kept.
If `lang` is given, it will apply language-specific tokenization to the
words that it reads.
"""
values = defaultdict(float)
total = 0.
with open(filename, encoding='utf-8', newline='') as infile:
for key, strval in csv.reader(infile):
val = float(strval)
key = fix_text(key)
if val < cutoff or len(values) >= max_words:
break
tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
for token in tokens:
# Use += so that, if we give the reader concatenated files with
# duplicates, it does the right thing
values[token] += val
total += val
return values, total
def read_freqs(filename, cutoff=0, lang=None):
"""
Read words and their frequencies from a CSV file, normalizing the
frequencies to add up to 1.
Only words with a frequency greater than or equal to `cutoff` are returned.
If `cutoff` is greater than 0, the csv file must be sorted by frequency
in descending order.
If lang is given, read_freqs will apply language specific preprocessing
operations.
"""
values, total = read_values(filename, cutoff, lang=lang)
for word in values:
values[word] /= total
if lang == 'en':
values = correct_apostrophe_trimming(values)
return values
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600):
"""
Convert a csv file of words and their frequencies to a file in the
idiosyncratic 'cBpack' format.
Only words with a frequency greater than `cutoff` centibels will be
written to the new file.
This cutoff should not be stacked with a cutoff in `read_freqs`; doing
so would skew the resulting frequencies.
"""
freqs = read_freqs(in_filename, cutoff=0, lang=None)
cBpack = []
for token, freq in freqs.items():
cB = round(math.log10(freq) * 100)
if cB <= cutoff:
continue
neg_cB = -cB
while neg_cB >= len(cBpack):
cBpack.append([])
cBpack[neg_cB].append(token)
for sublist in cBpack:
sublist.sort()
# Write a "header" consisting of a dictionary at the start of the file
cBpack_data = [{'format': 'cB', 'version': 1}] + cBpack
with gzip.open(out_filename, 'wb') as outfile:
msgpack.dump(cBpack_data, outfile)
def merge_counts(count_dicts):
"""
Merge multiple dictionaries of counts by adding their entries.
"""
merged = defaultdict(int)
for count_dict in count_dicts:
for term, count in count_dict.items():
merged[term] += count
return merged
def merge_freqs(freq_dicts):
"""
Merge multiple dictionaries of frequencies, representing each word with
the median of the word's frequency over all sources.
"""
vocab = set()
for freq_dict in freq_dicts:
vocab.update(freq_dict)
merged = defaultdict(float)
N = len(freq_dicts)
for term in vocab:
freqs = []
missing_values = 0
for freq_dict in freq_dicts:
freq = freq_dict.get(term, 0.)
if freq < 1e-8:
# Usually we trust the median of the wordlists, but when at
# least 2 wordlists say a word exists and the rest say it
# doesn't, we kind of want to listen to the two that have
# information about the word. The word might be a word that's
# inconsistently accounted for, such as an emoji or a word
# containing an apostrophe.
#
# So, once we see at least 2 values that are very low or
# missing, we ignore further low values in the median. A word
# that appears in 2 sources gets a reasonable frequency, while
# a word that appears in 1 source still gets dropped.
missing_values += 1
if missing_values > 2:
continue
freqs.append(0.)
else:
freqs.append(freq)
if freqs:
median = statistics.median(freqs)
if median > 0.:
merged[term] = median
total = sum(merged.values())
# Normalize the merged values so that they add up to 0.99 (based on
# a rough estimate that 1% of tokens will be out-of-vocabulary in a
# wordlist of this size).
for term in merged:
merged[term] = merged[term] / total * 0.99
return merged
def write_wordlist(freqs, filename, cutoff=1e-8):
"""
Write a dictionary of either raw counts or frequencies to a file of
comma-separated values.
Keep the CSV format simple by explicitly skipping words containing
commas or quotation marks. We don't believe we want those in our tokens
anyway.
"""
with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
writer = csv.writer(outfile)
items = sorted(freqs.items(), key=itemgetter(1), reverse=True)
for word, freq in items:
if freq < cutoff:
break
if not ('"' in word or ',' in word):
writer.writerow([word, str(freq)])
def write_jieba(freqs, filename):
"""
Write a dictionary of frequencies in a format that can be used for Jieba
tokenization of Chinese.
"""
with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
items = sorted(freqs.items(), key=lambda item: (-item[1], item[0]))
for word, freq in items:
if HAN_RE.search(word):
# Only store this word as a token if it contains at least one
# Han character.
fake_count = round(freq * 1e9)
print('%s %d' % (word, fake_count), file=outfile)
# APOSTROPHE_TRIMMED_PROB represents the probability that this word has had
# "'t" removed from it, based on counts from Twitter, which we know
# accurate token counts for based on our own tokenizer.
APOSTROPHE_TRIMMED_PROB = {
'don': 0.99,
'didn': 1.,
'can': 0.35,
'won': 0.74,
'isn': 1.,
'wasn': 1.,
'wouldn': 1.,
'doesn': 1.,
'couldn': 1.,
'ain': 0.99,
'aren': 1.,
'shouldn': 1.,
'haven': 0.96,
'weren': 1.,
'hadn': 1.,
'hasn': 1.,
'mustn': 1.,
'needn': 1.,
}
def correct_apostrophe_trimming(freqs):
"""
If what we got was an English wordlist that has been tokenized with
apostrophes as token boundaries, as indicated by the frequencies of the
words "wouldn" and "couldn", then correct the spurious tokens we get by
adding "'t" in about the proportion we expect to see in the wordlist.
We could also adjust the frequency of "t", but then we would be favoring
the token "s" over it, as "'s" leaves behind no indication when it's been
removed.
"""
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
if trim_word in freqs:
freq = freqs[trim_word]
freqs[trim_word] = freq * (1 - trim_prob)
freqs[trim_word + "'t"] = freq * trim_prob
return freqs