Merge pull request #34 from LuminosoInsight/big-list

wordfreq 1.4: some bigger wordlists, better use of language detection Former-commit-id: e7b34fb655
2024-12-23 09:21:37 +00:00 · 2016-05-11 16:27:51 -04:00 · 2016-05-11 16:27:51 -04:00 · 7a55e0ed86
commit 7a55e0ed86
parent 298cb69353 1ac6795709
52 changed files with 291 additions and 122 deletions
--- a/README.md
+++ b/README.md
@ -39,11 +39,18 @@ For example:
 ## Usage

 wordfreq provides access to estimates of the frequency with which a word is
-used, in 18 languages (see *Supported languages* below). It loads
-efficiently-packed data structures that contain all words that appear at least
-once per million words.
+used, in 18 languages (see *Supported languages* below).

-The most useful function is:
+It provides three kinds of pre-built wordlists:
+
+- `'combined'` lists, containing words that appear at least once per
+  million words, averaged across all data sources.
+- `'twitter'` lists, containing words that appear at least once per
+  million words on Twitter alone.
+- `'large'` lists, containing words that appear at least once per 100
+  million words, averaged across all data sources.
+
+The most straightforward function is:

    word_frequency(word, lang, wordlist='combined', minimum=0.0)

@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
    >>> word_frequency('café', 'fr') * 1e6
    77.62471166286912

-The parameters are:
+
+`zipf_frequency` is a variation on `word_frequency` that aims to return the
+word frequency on a human-friendly logarithmic scale. The Zipf scale was
+proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
+of a word is the base-10 logarithm of the number of times it appears per
+billion words. A word with Zipf value 6 appears once per thousand words, for
+example, and a word with Zipf value 3 appears once per million words.
+
+Reasonable Zipf values are between 0 and 8, but because of the cutoffs
+described above, the minimum Zipf value appearing in these lists is 1.0 for the
+'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
+for words that do not appear in the given wordlist, although it should mean
+one occurrence per billion words.
+
+    >>> zipf_frequency('the', 'en')
+    7.59
+
+    >>> zipf_frequency('word', 'en')
+    5.34
+
+    >>> zipf_frequency('frequency', 'en')
+    4.44
+
+    >>> zipf_frequency('zipf', 'en')
+    0.0
+
+    >>> zipf_frequency('zipf', 'en', wordlist='large')
+    1.42
+
+
+The parameters to `word_frequency` and `zipf_frequency` are:

 - `word`: a Unicode string containing the word to look up. Ideally the word
  is a single token according to our tokenizer, but if not, there is still
@ -73,21 +110,18 @@ The parameters are:
 - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.

 - `wordlist`: which set of word frequencies to use. Current options are
-  'combined', which combines up to five different sources, and
-  'twitter', which returns frequencies observed on Twitter alone.
+  'combined', 'twitter', and 'large'.

 - `minimum`: If the word is not in the list or has a frequency lower than
-  `minimum`, return `minimum` instead. In some applications, you'll want
-  to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
-  a frequency of 1e-6 (1 per million) is the threshold for being included in
-  the list at all.
+  `minimum`, return `minimum` instead. You may want to set this to the minimum
+  value contained in the wordlist, to avoid a discontinuity where the wordlist
+  ends.

 Other functions:

 `tokenize(text, lang)` splits text in the given language into words, in the same
 way that the words in wordfreq's data were counted in the first place. See
-*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
-to be installed.
+*Tokenization*.

 `top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
 the list, in descending frequency order.
@ -133,6 +167,7 @@ The sources (and the abbreviations we'll use for them) are:
 - **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
 - **Twitter**: Messages sampled from Twitter's public stream
 - **Wpedia**: The full text of Wikipedia in 2015
+- **Reddit**: The corpus of Reddit comments through May 2015
 - **Other**: We get additional English frequencies from Google Books Syntactic
  Ngrams 2013, and Chinese frequencies from the frequency dictionary that
  comes with the Jieba tokenizer.
@ -140,33 +175,37 @@ The sources (and the abbreviations we'll use for them) are:
 The following 17 languages are well-supported, with reasonable tokenization and
 at least 3 different sources of word frequencies:

-    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Other
+    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Reddit  Other
    ──────────────────┼─────────────────────────────────────────────────────
-    Arabic      ar    │ -       Yes     Yes     Yes     Yes     -
-    German      de    │ Yes     -       Yes     Yes[1]  Yes     -
-    Greek       el    │ -       Yes     Yes     Yes     Yes     -
-    English     en    │ Yes     Yes     Yes     Yes     Yes     Google Books
-    Spanish     es    │ -       Yes     Yes     Yes     Yes     -
-    French      fr    │ -       Yes     Yes     Yes     Yes     -
-    Indonesian  id    │ -       Yes     -       Yes     Yes     -
-    Italian     it    │ -       Yes     Yes     Yes     Yes     -
-    Japanese    ja    │ -       -       Yes     Yes     Yes     -
-    Malay       ms    │ -       Yes     -       Yes     Yes     -
-    Dutch       nl    │ Yes     Yes     -       Yes     Yes     -
-    Polish      pl    │ -       Yes     -       Yes     Yes     -
-    Portuguese  pt    │ -       Yes     Yes     Yes     Yes     -
-    Russian     ru    │ -       Yes     Yes     Yes     Yes     -
-    Swedish     sv    │ -       Yes     -       Yes     Yes     -
-    Turkish     tr    │ -       Yes     -       Yes     Yes     -
-    Chinese     zh    │ Yes     -       Yes     -       -       Jieba
+    Arabic      ar    │ -       Yes     Yes     Yes     Yes     -       -
+    German      de    │ Yes     -       Yes     Yes[1]  Yes     -       -
+    Greek       el    │ -       Yes     Yes     Yes     Yes     -       -
+    English     en    │ Yes     Yes     Yes     Yes     Yes     Yes     Google Books
+    Spanish     es    │ -       Yes     Yes     Yes     Yes     -       -
+    French      fr    │ -       Yes     Yes     Yes     Yes     -       -
+    Indonesian  id    │ -       Yes     -       Yes     Yes     -       -
+    Italian     it    │ -       Yes     Yes     Yes     Yes     -       -
+    Japanese    ja    │ -       -       Yes     Yes     Yes     -       -
+    Malay       ms    │ -       Yes     -       Yes     Yes     -       -
+    Dutch       nl    │ Yes     Yes     -       Yes     Yes     -       -
+    Polish      pl    │ -       Yes     -       Yes     Yes     -       -
+    Portuguese  pt    │ -       Yes     Yes     Yes     Yes     -       -
+    Russian     ru    │ -       Yes     Yes     Yes     Yes     -       -
+    Swedish     sv    │ -       Yes     -       Yes     Yes     -       -
+    Turkish     tr    │ -       Yes     -       Yes     Yes     -       -
+    Chinese     zh    │ Yes     -       Yes     -       -       -       Jieba


 Additionally, Korean is marginally supported. You can look up frequencies in
-it, but we have too few data sources for it so far:
+it, but it will be insufficiently tokenized into words, and we have too few
+data sources for it so far:

-    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia
-    ──────────────────┼───────────────────────────────────────
-    Korean      ko    │ -       -       -       Yes     Yes
+    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Reddit
+    ──────────────────┼───────────────────────────────────────────────
+    Korean      ko    │ -       -       -       Yes     Yes     -
+
+The 'large' wordlists are available in English, German, Spanish, French, and
+Portuguese.

 [1] We've counted the frequencies from tweets in German, such as they are, but
 you should be aware that German is not a frequently-used language on Twitter.
@ -179,7 +218,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
 implementation of regular expressions than the standard library, to
 separate text into tokens that can be counted consistently. `regex`
 produces tokens that follow the recommendations in [Unicode
-Annex #29, Text Segmentation][uax29].
+Annex #29, Text Segmentation][uax29], including the optional rule that
+splits words between apostrophes and vowels.

 There are language-specific exceptions:

@ -199,10 +239,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
 also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

-    >>> word_frequency('New York', 'en')
-    0.0002315934248950231
-    >>> word_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.2187603965715087e-06
+    >>> zipf_frequency('New York', 'en')
+    5.31
+    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
+    3.51

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -216,8 +256,8 @@ frequencies, because that would assume they are statistically unrelated. So if
 you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

-    >>> word_frequency('owl-flavored', 'en')
-    1.3557098723512335e-06
+    >>> zipf_frequency('owl-flavored', 'en')
+    3.18


 ## License
--- a/setup.py
+++ b/setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):

 setup(
    name="wordfreq",
-    version='1.3',
+    version='1.4',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
--- a/wordfreq/init.py
+++ b/wordfreq/init.py
@ -8,6 +8,7 @@ import itertools
 import pathlib
 import random
 import logging
+import math

 logger = logging.getLogger(__name__)

@ -146,6 +147,42 @@ def cB_to_freq(cB):
    return 10 ** (cB / 100)


+def cB_to_zipf(cB):
+    """
+    Convert a word frequency from centibels to the Zipf scale
+    (see `zipf_to_freq`).
+
+    The Zipf scale is related to centibels, the logarithmic unit that wordfreq
+    uses internally, because the Zipf unit is simply the bel, with a different
+    zero point. To convert centibels to Zipf, add 900 and divide by 100.
+    """
+    return (cB + 900) / 100
+
+
+def zipf_to_freq(zipf):
+    """
+    Convert a word frequency from the Zipf scale to a proportion between 0 and
+    1.
+
+    The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
+    who compiled the SUBTLEX data. The goal of the Zipf scale is to map
+    reasonable word frequencies to understandable, small positive numbers.
+
+    A word rates as x on the Zipf scale when it occurs 10**x times per billion
+    words. For example, a word that occurs once per million words is at 3.0 on
+    the Zipf scale.
+    """
+    return 10 ** zipf / 1e9
+
+
+def freq_to_zipf(freq):
+    """
+    Convert a word frequency from a proportion between 0 and 1 to the
+    Zipf scale (see `zipf_to_freq`).
+    """
+    return math.log(freq, 10) + 9
+
+
@lru_cache(maxsize=None)
 def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
    """
@ -202,6 +239,7 @@ def _word_frequency(word, lang, wordlist, minimum):

    return max(freq, minimum)

+
 def word_frequency(word, lang, wordlist='combined', minimum=0.):
    """
    Get the frequency of `word` in the language with code `lang`, from the
@ -240,6 +278,33 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
        return _wf_cache[args]


+def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
+    """
+    Get the frequency of `word`, in the language with code `lang`, on the Zipf
+    scale.
+    
+    The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
+    who compiled the SUBTLEX data. The goal of the Zipf scale is to map
+    reasonable word frequencies to understandable, small positive numbers.
+    
+    A word rates as x on the Zipf scale when it occurs 10**x times per billion
+    words. For example, a word that occurs once per million words is at 3.0 on
+    the Zipf scale.
+    
+    Zipf values for reasonable words are between 0 and 8. The value this
+    function returns will always be at last as large as `minimum`, even for a
+    word that never appears. The default minimum is 0, representing words
+    that appear once per billion words or less.
+
+    wordfreq internally quantizes its frequencies to centibels, which are
+    1/100 of a Zipf unit. The output of `zipf_frequency` will be rounded to
+    the nearest hundredth to match this quantization.
+    """
+    freq_min = zipf_to_freq(minimum)
+    freq = word_frequency(word, lang, wordlist, freq_min)
+    return round(freq_to_zipf(freq), 2)
+
+
@lru_cache(maxsize=100)
 def top_n_list(lang, n, wordlist='combined', ascii_only=False):
    """
--- a/wordfreq/data/combined_ar.msgpack.gz
+++ b/wordfreq/data/combined_ar.msgpack.gz
--- a/wordfreq/data/combined_de.msgpack.gz
+++ b/wordfreq/data/combined_de.msgpack.gz
--- a/wordfreq/data/combined_el.msgpack.gz
+++ b/wordfreq/data/combined_el.msgpack.gz
--- a/wordfreq/data/combined_en.msgpack.gz
+++ b/wordfreq/data/combined_en.msgpack.gz
--- a/wordfreq/data/combined_es.msgpack.gz
+++ b/wordfreq/data/combined_es.msgpack.gz
--- a/wordfreq/data/combined_fr.msgpack.gz
+++ b/wordfreq/data/combined_fr.msgpack.gz
--- a/wordfreq/data/combined_id.msgpack.gz
+++ b/wordfreq/data/combined_id.msgpack.gz
--- a/wordfreq/data/combined_it.msgpack.gz
+++ b/wordfreq/data/combined_it.msgpack.gz
--- a/wordfreq/data/combined_ja.msgpack.gz
+++ b/wordfreq/data/combined_ja.msgpack.gz
--- a/wordfreq/data/combined_ko.msgpack.gz
+++ b/wordfreq/data/combined_ko.msgpack.gz
--- a/wordfreq/data/combined_ms.msgpack.gz
+++ b/wordfreq/data/combined_ms.msgpack.gz
--- a/wordfreq/data/combined_nl.msgpack.gz
+++ b/wordfreq/data/combined_nl.msgpack.gz
--- a/wordfreq/data/combined_pl.msgpack.gz
+++ b/wordfreq/data/combined_pl.msgpack.gz
--- a/wordfreq/data/combined_pt.msgpack.gz
+++ b/wordfreq/data/combined_pt.msgpack.gz
--- a/wordfreq/data/combined_ru.msgpack.gz
+++ b/wordfreq/data/combined_ru.msgpack.gz
--- a/wordfreq/data/combined_sv.msgpack.gz
+++ b/wordfreq/data/combined_sv.msgpack.gz
--- a/wordfreq/data/combined_tr.msgpack.gz
+++ b/wordfreq/data/combined_tr.msgpack.gz
--- a/wordfreq/data/combined_zh.msgpack.gz
+++ b/wordfreq/data/combined_zh.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/twitter_ar.msgpack.gz
+++ b/wordfreq/data/twitter_ar.msgpack.gz
--- a/wordfreq/data/twitter_de.msgpack.gz
+++ b/wordfreq/data/twitter_de.msgpack.gz
--- a/wordfreq/data/twitter_el.msgpack.gz
+++ b/wordfreq/data/twitter_el.msgpack.gz
--- a/wordfreq/data/twitter_en.msgpack.gz
+++ b/wordfreq/data/twitter_en.msgpack.gz
--- a/wordfreq/data/twitter_es.msgpack.gz
+++ b/wordfreq/data/twitter_es.msgpack.gz
--- a/wordfreq/data/twitter_fr.msgpack.gz
+++ b/wordfreq/data/twitter_fr.msgpack.gz
--- a/wordfreq/data/twitter_id.msgpack.gz
+++ b/wordfreq/data/twitter_id.msgpack.gz
--- a/wordfreq/data/twitter_it.msgpack.gz
+++ b/wordfreq/data/twitter_it.msgpack.gz
--- a/wordfreq/data/twitter_ja.msgpack.gz
+++ b/wordfreq/data/twitter_ja.msgpack.gz
--- a/wordfreq/data/twitter_ko.msgpack.gz
+++ b/wordfreq/data/twitter_ko.msgpack.gz
--- a/wordfreq/data/twitter_ms.msgpack.gz
+++ b/wordfreq/data/twitter_ms.msgpack.gz
--- a/wordfreq/data/twitter_nl.msgpack.gz
+++ b/wordfreq/data/twitter_nl.msgpack.gz
--- a/wordfreq/data/twitter_pl.msgpack.gz
+++ b/wordfreq/data/twitter_pl.msgpack.gz
--- a/wordfreq/data/twitter_pt.msgpack.gz
+++ b/wordfreq/data/twitter_pt.msgpack.gz
--- a/wordfreq/data/twitter_ru.msgpack.gz
+++ b/wordfreq/data/twitter_ru.msgpack.gz
--- a/wordfreq/data/twitter_sv.msgpack.gz
+++ b/wordfreq/data/twitter_sv.msgpack.gz
--- a/wordfreq/data/twitter_tr.msgpack.gz
+++ b/wordfreq/data/twitter_tr.msgpack.gz
--- a/wordfreq_builder/rules.ninja
+++ b/wordfreq_builder/rules.ninja
@ -46,6 +46,9 @@ rule simplify_chinese
 rule tokenize_twitter
  command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix

+rule tokenize_reddit
+  command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_reddit $in $prefix
+
 # To convert the Leeds corpus, look for space-separated lines that start with
 # an integer and a decimal. The integer is the rank, which we discard. The
 # decimal is the frequency, and the remaining text is the term. Use sed -n
@ -95,10 +98,10 @@ rule merge_counts
  command = python -m wordfreq_builder.cli.merge_counts -o $out -c $cutoff $in

 rule freqs2cB
-  command = python -m wordfreq_builder.cli.freqs_to_cB $in $out
+  command = python -m wordfreq_builder.cli.freqs_to_cB $in $out -b $buckets

 rule cat
  command = cat $in > $out

 rule extract_reddit
-  command = bunzip2 -c $in | $JQ -r '.body' | fgrep -v '[deleted]' | sed 's/&gt;/>/g' | sed 's/&lt;/</g' | sed 's/&amp;/\&/g' | gzip -c > $out
+  command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/&gt;/>/g' | sed 's/&lt;/</g' | sed 's/&amp;/\&/g' > $out
--- a/wordfreq_builder/setup.py
+++ b/wordfreq_builder/setup.py
@ -2,12 +2,12 @@ from setuptools import setup

 setup(
    name="wordfreq_builder",
-    version='0.1',
+    version='0.2',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq_builder',
    platforms=["any"],
    description="Turns raw data into word frequency lists",
    packages=['wordfreq_builder'],
-    install_requires=['msgpack-python', 'pycld2']
+    install_requires=['msgpack-python', 'pycld2', 'langcodes']
 )
--- a/wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
+++ b/wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
@ -6,5 +6,9 @@ if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('filename_in', help='name of input file containing tokens')
    parser.add_argument('filename_out', help='name of output file')
+    parser.add_argument('-b', '--buckets', type=int, default=600,
+                        help='Number of centibel buckets to include (default 600). '
+                             'Increasing this number creates a longer wordlist with '
+                             'rarer words.')
    args = parser.parse_args()
-    freqs_to_cBpack(args.filename_in, args.filename_out)
+    freqs_to_cBpack(args.filename_in, args.filename_out, cutoff=-(args.buckets))
--- a/wordfreq_builder/wordfreq_builder/cli/merge_counts.py
+++ b/wordfreq_builder/wordfreq_builder/cli/merge_counts.py
@ -2,10 +2,10 @@ from wordfreq_builder.word_counts import read_values, merge_counts, write_wordli
 import argparse


-def merge_lists(input_names, output_name, cutoff=0):
+def merge_lists(input_names, output_name, cutoff=0, max_words=1000000):
    count_dicts = []
    for input_name in input_names:
-        values, total = read_values(input_name, cutoff=cutoff, max_size=1000000)
+        values, total = read_values(input_name, cutoff=cutoff, max_words=max_words)
        count_dicts.append(values)
    merged = merge_counts(count_dicts)
    write_wordlist(merged, output_name)
@ -17,8 +17,9 @@ if __name__ == '__main__':
                        help='filename to write the output to')
    parser.add_argument('-c', '--cutoff', type=int, default=0,
                        help='minimum count to read from an input file')
+    parser.add_argument('-m', '--max-words', type=int, default=1000000,
+                        help='maximum number of words to read from each list')
    parser.add_argument('inputs', nargs='+',
                        help='names of input files to merge')
    args = parser.parse_args()
-    merge_lists(args.inputs, args.output, cutoff=args.cutoff)
-
+    merge_lists(args.inputs, args.output, cutoff=args.cutoff, max_words=args.max_words)
--- a/wordfreq_builder/wordfreq_builder/cli/tokenize_reddit.py
+++ b/wordfreq_builder/wordfreq_builder/cli/tokenize_reddit.py
@ -1,13 +1,17 @@
-from wordfreq_builder.tokenizers import cld2_reddit_tokenizer, tokenize_by_language
+from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_by_language
 import argparse


+def reddit_tokenizer(text):
+    return cld2_surface_tokenizer(text, mode='reddit')
+
+
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('filename', help='filename of input file containing one comment per line')
    parser.add_argument('outprefix', help='prefix of output filenames')
    args = parser.parse_args()
-    tokenize_by_language(args.filename, args.outprefix, tokenizer=cld2_reddit_tokenizer)
+    tokenize_by_language(args.filename, args.outprefix, tokenizer=reddit_tokenizer)


 if __name__ == '__main__':
--- a/wordfreq_builder/wordfreq_builder/config.py
+++ b/wordfreq_builder/wordfreq_builder/config.py
@ -41,7 +41,11 @@ CONFIG = {
        'subtlex-en': ['en'],
        'subtlex-other': ['de', 'nl', 'zh'],
        'jieba': ['zh'],
-        'reddit': ['en'],
+
+        # About 99.2% of Reddit is in English. There are pockets of
+        # conversation in other languages, but we're concerned that they're not
+        # representative enough for learning general word frequencies.
+        'reddit': ['en']
    },
    # Subtlex languages that need to be pre-processed
    'wordlist_paths': {
@ -56,10 +60,12 @@ CONFIG = {
        'reddit': 'generated/reddit/reddit_{lang}.{ext}',
        'combined': 'generated/combined/combined_{lang}.{ext}',
        'combined-dist': 'dist/combined_{lang}.{ext}',
+        'combined-dist-large': 'dist/large_{lang}.{ext}',
        'twitter-dist': 'dist/twitter_{lang}.{ext}',
        'jieba-dist': 'dist/jieba_{lang}.{ext}'
    },
-    'min_sources': 2
+    'min_sources': 2,
+    'big-lists': ['en', 'fr', 'es', 'pt', 'de']
 }


--- a/wordfreq_builder/wordfreq_builder/ninja.py
+++ b/wordfreq_builder/wordfreq_builder/ninja.py
@ -4,6 +4,8 @@ from wordfreq_builder.config import (
 import sys
 import pathlib
 import itertools
+from collections import defaultdict
+

 HEADER = """# This file is automatically generated. Do not edit it.
 # You can change its behavior by editing wordfreq_builder/ninja.py,
@ -155,14 +157,12 @@ def twitter_deps(input_filename, slice_prefix, combined_prefix, slices,

    for language in languages:
        combined_output = wordlist_filename('twitter', language, 'tokens.txt')
-
        language_inputs = [
            '{prefix}.{lang}.txt'.format(
                prefix=slice_files[slicenum], lang=language
            )
            for slicenum in range(slices)
        ]
-
        add_dep(lines, 'cat', language_inputs, combined_output)

        count_file = wordlist_filename('twitter', language, 'counts.txt')
@ -238,23 +238,40 @@ def jieba_deps(dirname_in, languages):

 def reddit_deps(dirname_in, languages):
    lines = []
-    if not languages:
-        return lines
-    assert languages == ['en']
-
-    processed_files = []
    path_in = pathlib.Path(dirname_in)
-    for filepath in path_in.glob('*/*.bz2'):
-        base = filepath.name[:-4]
-        transformed_file = wordlist_filename('reddit', 'en', base + '.txt.gz')
-        add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
-        count_file = wordlist_filename('reddit', 'en', base + '.counts.txt')
-        add_dep(lines, 'count', transformed_file, count_file)
-        processed_files.append(count_file)
+    slices = {}
+    counts_by_language = defaultdict(list)

-    output_file = wordlist_filename('reddit', 'en', 'counts.txt')
+    # Extract text from the Reddit comment dumps, and write them to
+    # .txt.gz files
+    for filepath in path_in.glob('*/*.bz2'):
+        base = filepath.stem
+        transformed_file = wordlist_filename('reddit', base + '.all', 'txt')
+        slices[base] = transformed_file
+        add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
+
+    for base in sorted(slices):
+        transformed_file = slices[base]
+        language_outputs = []
+        for language in languages:
+            filename = wordlist_filename('reddit', base + '.' + language, 'txt')
+            language_outputs.append(filename)
+
+            count_filename = wordlist_filename('reddit', base + '.' + language, 'counts.txt')
+            add_dep(lines, 'count', filename, count_filename)
+            counts_by_language[language].append(count_filename)
+
+        # find the prefix by constructing a filename, then stripping off
+        # '.xx.txt' from the end
+        prefix = wordlist_filename('reddit', base + '.xx', 'txt')[:-7]
+        add_dep(lines, 'tokenize_reddit', transformed_file, language_outputs,
+                params={'prefix': prefix},
+                extra='wordfreq_builder/tokenizers.py')
+
+    for language in languages:
+        output_file = wordlist_filename('reddit', language, 'counts.txt')
        add_dep(
-        lines, 'merge_counts', processed_files, output_file,
+            lines, 'merge_counts', counts_by_language[language], output_file,
            params={'cutoff': 3}
        )
    return lines
@ -345,11 +362,19 @@ def combine_lists(languages):
        output_cBpack = wordlist_filename(
            'combined-dist', language, 'msgpack.gz'
        )
+        output_cBpack_big = wordlist_filename(
+            'combined-dist-large', language, 'msgpack.gz'
+        )
        add_dep(lines, 'freqs2cB', output_file, output_cBpack,
                extra='wordfreq_builder/word_counts.py',
-                params={'lang': language})
+                params={'lang': language, 'buckets': 600})
+        add_dep(lines, 'freqs2cB', output_file, output_cBpack_big,
+                extra='wordfreq_builder/word_counts.py',
+                params={'lang': language, 'buckets': 800})

        lines.append('default {}'.format(output_cBpack))
+        if language in CONFIG['big-lists']:
+            lines.append('default {}'.format(output_cBpack_big))

        # Write standalone lists for Twitter frequency
        if language in CONFIG['sources']['twitter']:
@ -358,7 +383,7 @@ def combine_lists(languages):
                'twitter-dist', language, 'msgpack.gz')
            add_dep(lines, 'freqs2cB', input_file, output_cBpack,
                    extra='wordfreq_builder/word_counts.py',
-                    params={'lang': language})
+                    params={'lang': language, 'buckets': 600})

            lines.append('default {}'.format(output_cBpack))

--- a/wordfreq_builder/wordfreq_builder/tokenizers.py
+++ b/wordfreq_builder/wordfreq_builder/tokenizers.py
@ -2,6 +2,7 @@ from wordfreq import tokenize
 from ftfy.fixes import unescape_html
 import regex
 import pycld2
+import langcodes

 CLD2_BAD_CHAR_RANGE = "[%s]" % "".join(
    [
@ -26,48 +27,63 @@ URL_RE = regex.compile(r'http(?:s)?://[^) ]*')
 MARKDOWN_URL_RESIDUE_RE = regex.compile(r'\]\(\)')


-def cld2_surface_tokenizer(text):
-    """
-    Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
-    """
-    text = unescape_html(text)
-    text = TWITTER_HANDLE_RE.sub('', text)
-    text = TCO_RE.sub('', text)
+# Low-frequency languages tend to be detected incorrectly by cld2. The
+# following list of languages are languages that appear in our data with any
+# reasonable frequency, and seem to usually be detected *correctly*. These are
+# the languages we'll keep in the Reddit and Twitter results.
+#
+# This list is larger than the list that wordfreq ultimately generates, so we
+# can look here as a source of future data.

-    lang = cld2_detect_language(text)
-
-    # Don't allow tokenization in Chinese when language-detecting, because
-    # the Chinese tokenizer may not be built yet
-    if lang == 'zh':
-        lang = 'en'
-
-    tokens = tokenize(text, lang)
-    return lang, tokens
-
-
-# Low-frequency languages tend to be detected incorrectly. Keep a limited
-# list of languages we're allowed to use here.
 KEEP_THESE_LANGUAGES = {
-    'ar', 'de', 'el', 'en', 'es', 'fr', 'hr', 'id', 'it', 'ja', 'ko', 'ms',
-    'nl', 'pl', 'pt', 'ro', 'ru', 'sv'
+    'af', 'ar', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fi',
+    'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'ko', 'lv',
+    'ms', 'nl', 'nn', 'no', 'pl', 'pt', 'ro', 'ru', 'sr', 'sv', 'sw', 'tl',
+    'tr', 'uk', 'vi'
 }

+# Semi-frequent languages that are excluded by the above:
+#
+#   - Chinese, not because it's detected incorrectly, but because we can't
+#     handle it until we already have word frequencies
+#   - Thai (seems to be detected whenever someone uses Thai characters in
+#     an emoticon)
+#   - Welsh (which is detected for "ohmygodohmygodohmygod")
+#   - Turkmen (detected for ASCII art)
+#   - Irish Gaelic (detected for Cthulhu-related text)
+#   - Kannada (looks of disapproval)
+#   - Lao, Tamil, Xhosa, Slovak (various emoticons and Internet memes)
+#   - Breton (the word "memes" itself)

-def cld2_reddit_tokenizer(text):
+
+def cld2_surface_tokenizer(text, mode='twitter'):
    """
-    A language-detecting tokenizer with special cases for handling text from
-    Reddit.
+    Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
+
+    The `mode` can be 'twitter' or 'reddit', which slightly changes the
+    pre-processing of the text.
    """
+    text = unescape_html(text)
+    if mode == 'twitter':
+        text = TWITTER_HANDLE_RE.sub('', text)
+        text = TCO_RE.sub('', text)
+    elif mode == 'reddit':
        text = URL_RE.sub('', text)
        text = MARKDOWN_URL_RESIDUE_RE.sub(']', text)

    lang = cld2_detect_language(text)
-    if lang not in KEEP_THESE_LANGUAGES:
-        # Reddit is 99.9% English, so if we detected a rare language, it's
-        # much more likely that it's actually English.
-        lang = 'en'

-    tokens = tokenize(text, lang, include_punctuation=True)
+    # If the detected language isn't in our pretty generous list of languages,
+    # return no tokens.
+    if lang not in KEEP_THESE_LANGUAGES:
+        return 'xx', []
+
+    # cld2's accuracy seems to improve dramatically with at least 50
+    # bytes of input, so throw away non-English below this length.
+    if len(text.encode('utf-8')) < 50 and lang != 'en':
+        return 'xx', []
+
+    tokens = tokenize(text, lang)
    return lang, tokens


@ -85,7 +101,12 @@ def cld2_detect_language(text):
    #       Confidence score: float))

    text = CLD2_BAD_CHARS_RE.sub('', text)
-    return pycld2.detect(text)[2][0][1]
+    lang = pycld2.detect(text)[2][0][1]
+
+    # Normalize the language code: 'iw' becomes 'he', and 'zh-Hant'
+    # becomes 'zh'
+    code = langcodes.get(lang).language
+    return code


 def tokenize_by_language(in_filename, out_prefix, tokenizer):
@ -95,19 +116,17 @@ def tokenize_by_language(in_filename, out_prefix, tokenizer):
    Produces output files that are separated by language, with spaces
    between the tokens.
    """
-    out_files = {}
+    out_files = {
+        language: open('%s.%s.txt' % (out_prefix, language), 'w', encoding='utf-8')
+        for language in KEEP_THESE_LANGUAGES
+    }
    with open(in_filename, encoding='utf-8') as in_file:
        for line in in_file:
            text = line.split('\t')[-1].strip()
            language, tokens = tokenizer(text)
-            if language != 'un':
+            if language in KEEP_THESE_LANGUAGES:
+                out_file = out_files[language]
                tokenized = ' '.join(tokens)
-                out_filename = '%s.%s.txt' % (out_prefix, language)
-                if out_filename in out_files:
-                    out_file = out_files[out_filename]
-                else:
-                    out_file = open(out_filename, 'w', encoding='utf-8')
-                    out_files[out_filename] = out_file
                print(tokenized, file=out_file)
    for out_file in out_files.values():
        out_file.close()
--- a/wordfreq_builder/wordfreq_builder/word_counts.py
+++ b/wordfreq_builder/wordfreq_builder/word_counts.py
@ -36,15 +36,17 @@ def count_tokens(filename):
    return counts


-def read_values(filename, cutoff=0, max_size=1e8, lang=None):
+def read_values(filename, cutoff=0, max_words=1e8, lang=None):
    """
    Read words and their frequency or count values from a CSV file. Returns
    a dictionary of values and the total of all values.

    Only words with a value greater than or equal to `cutoff` are returned.
+    In addition, only up to `max_words` words are read.

-    If `cutoff` is greater than 0, the csv file must be sorted by value
-    in descending order.
+    If `cutoff` is greater than 0 or `max_words` is smaller than the list,
+    the csv file must be sorted by value in descending order, so that the
+    most frequent words are kept.

    If `lang` is given, it will apply language-specific tokenization to the
    words that it reads.
@ -55,7 +57,7 @@ def read_values(filename, cutoff=0, max_size=1e8, lang=None):
        for key, strval in csv.reader(infile):
            val = float(strval)
            key = fix_text(key)
-            if val < cutoff or len(values) >= max_size:
+            if val < cutoff or len(values) >= max_words:
                break
            tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
            for token in tokens: