Add Common Crawl data and more languages (#39)

This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: e6a8f028e3
2024-12-23 17:31:41 +00:00 · 2016-07-28 19:23:17 -04:00 · 2016-07-28 19:23:17 -04:00 · 9758c69ff0
commit 9758c69ff0
parent a0893af82e
68 changed files with 24828 additions and 36204 deletions
--- a/README.md
+++ b/README.md
@ -60,16 +60,16 @@ frequencies by a million (1e6) to get more readable numbers:

    >>> from wordfreq import word_frequency
    >>> word_frequency('cafe', 'en') * 1e6
-    14.45439770745928
+    12.88249551693135

    >>> word_frequency('café', 'en') * 1e6
-    4.7863009232263805
+    3.3884415613920273

    >>> word_frequency('cafe', 'fr') * 1e6
-    2.0417379446695274
+    2.6302679918953817

    >>> word_frequency('café', 'fr') * 1e6
-    77.62471166286912
+    87.09635899560814


 `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -85,20 +85,21 @@ described above, the minimum Zipf value appearing in these lists is 1.0 for the
 for words that do not appear in the given wordlist, although it should mean
 one occurrence per billion words.

+    >>> from wordfreq import zipf_frequency
    >>> zipf_frequency('the', 'en')
-    7.59
+    7.67

    >>> zipf_frequency('word', 'en')
-    5.34
+    5.39

    >>> zipf_frequency('frequency', 'en')
-    4.44
+    4.19

    >>> zipf_frequency('zipf', 'en')
    0.0

    >>> zipf_frequency('zipf', 'en', wordlist='large')
-    1.42
+    1.65


 The parameters to `word_frequency` and `zipf_frequency` are:
@ -128,10 +129,10 @@ the list, in descending frequency order.

    >>> from wordfreq import top_n_list
    >>> top_n_list('en', 10)
-    ['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']
+    ['the', 'i', 'to', 'a', 'and', 'of', 'you', 'in', 'that', 'is']

    >>> top_n_list('es', 10)
-    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']
+    ['de', 'que', 'la', 'y', 'a', 'en', 'el', 'no', 'los', 'es']

 `iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
 wordlist, in descending frequency order.
@ -168,48 +169,56 @@ The sources (and the abbreviations we'll use for them) are:
 - **Twitter**: Messages sampled from Twitter's public stream
 - **Wpedia**: The full text of Wikipedia in 2015
 - **Reddit**: The corpus of Reddit comments through May 2015
+- **CCrawl**: Text extracted from the Common Crawl and language-detected with cld2
 - **Other**: We get additional English frequencies from Google Books Syntactic
  Ngrams 2013, and Chinese frequencies from the frequency dictionary that
  comes with the Jieba tokenizer.

-The following 17 languages are well-supported, with reasonable tokenization and
-at least 3 different sources of word frequencies:
+The following 27 languages are supported, with reasonable tokenization and at
+least 3 different sources of word frequencies:

-    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Reddit  Other
-    ──────────────────┼─────────────────────────────────────────────────────
-    Arabic      ar    │ -       Yes     Yes     Yes     Yes     -       -
-    German      de    │ Yes     -       Yes     Yes[1]  Yes     -       -
-    Greek       el    │ -       Yes     Yes     Yes     Yes     -       -
-    English     en    │ Yes     Yes     Yes     Yes     Yes     Yes     Google Books
-    Spanish     es    │ -       Yes     Yes     Yes     Yes     -       -
-    French      fr    │ -       Yes     Yes     Yes     Yes     -       -
-    Indonesian  id    │ -       Yes     -       Yes     Yes     -       -
-    Italian     it    │ -       Yes     Yes     Yes     Yes     -       -
-    Japanese    ja    │ -       -       Yes     Yes     Yes     -       -
-    Malay       ms    │ -       Yes     -       Yes     Yes     -       -
-    Dutch       nl    │ Yes     Yes     -       Yes     Yes     -       -
-    Polish      pl    │ -       Yes     -       Yes     Yes     -       -
-    Portuguese  pt    │ -       Yes     Yes     Yes     Yes     -       -
-    Russian     ru    │ -       Yes     Yes     Yes     Yes     -       -
-    Swedish     sv    │ -       Yes     -       Yes     Yes     -       -
-    Turkish     tr    │ -       Yes     -       Yes     Yes     -       -
-    Chinese     zh    │ Yes     -       Yes     -       -       -       Jieba
+    Language    Code    Sources Large?   SUBTLEX OpenSub LeedsIC Twitter Wpedia  CCrawl  Reddit  Other
+    ───────────────────────────────────┼──────────────────────────────────────────────────────────────
+    Arabic      ar      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
+    Bulgarian   bg      3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    Catalan     ca      3       -      │ -       Yes     -       Yes     Yes     -       -       -
+    Danish      da      3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    German      de      5       Yes    │ Yes     -       Yes     Yes     Yes     Yes     -       -
+    Greek       el      4       -      │ -       Yes     Yes     -       Yes     Yes     -       -
+    English     en      7       Yes    │ Yes     Yes     Yes     Yes     Yes     -       Yes     Google Books
+    Spanish     es      6       Yes    │ -       Yes     Yes     Yes     Yes     Yes     Yes     -
+    Finnish     fi      3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    French      fr      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
+    Hebrew      he      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Hindi       hi      3       -      │ -       -       -       Yes     Yes     Yes     -       -
+    Hungarian   hu      3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    Indonesian  id      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Italian     it      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
+    Japanese    ja      4       -      │ -       -       Yes     Yes     Yes     Yes     -       -
+    Korean      ko      3       -      │ -       -       -       Yes     Yes     Yes     -       -
+    Malay       ms      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Norwegian   nb[1]   3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    Dutch       nl      5       Yes    │ Yes     Yes     -       Yes     Yes     Yes     -       -
+    Polish      pl      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Portuguese  pt      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
+    Romanian    ro      3       -      │ -       Yes     -       -       Yes     Yes     -       -
+    Russian     ru      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
+    Swedish     sv      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Turkish     tr      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
+    Chinese     zh[2]   5       -      │ Yes     -       Yes     -       Yes     Yes     -       Jieba

+[1] The Norwegian text we have is specifically written in Norwegian Bokmål, so
+we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
+isn't enough data to include it in wordfreq.

-Additionally, Korean is marginally supported. You can look up frequencies in
-it, but it will be insufficiently tokenized into words, and we have too few
-data sources for it so far:
+[2] This data represents text written in both Simplified and Traditional
+Chinese. (SUBTLEX is mostly Simplified, while Wikipedia is mostly Traditional.)
+The characters are mapped to one another so they can use the same word
+frequency list.

-    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Reddit
-    ──────────────────┼───────────────────────────────────────────────
-    Korean      ko    │ -       -       -       Yes     Yes     -
-
-The 'large' wordlists are available in English, German, Spanish, French, and
-Portuguese.
-
-[1] We've counted the frequencies from tweets in German, such as they are, but
-you should be aware that German is not a frequently-used language on Twitter.
-Germans just don't tweet that much.
+Some languages provide 'large' wordlists, including words with a Zipf frequency
+between 1.0 and 3.0. These are available in 9 languages that are covered by
+enough data sources.


 ## Tokenization
@ -223,10 +232,13 @@ splits words between apostrophes and vowels.

 There are language-specific exceptions:

- In Arabic, it additionally normalizes ligatures and removes combining marks.
- In Japanese, instead of using the regex library, it uses the external library
-  `mecab-python3`. This is an optional dependency of wordfreq, and compiling
-  it requires the `libmecab-dev` system package to be installed.
+- In Arabic and Hebrew, it additionally normalizes ligatures and removes
+  combining marks.
+
+- In Japanese and Korean, instead of using the regex library, it uses the
+  external library `mecab-python3`. This is an optional dependency of wordfreq,
+  and compiling it requires the `libmecab-dev` system package to be installed.
+
 - In Chinese, it uses the external Python library `jieba`, another optional
  dependency.

@ -240,9 +252,9 @@ also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

    >>> zipf_frequency('New York', 'en')
-    5.31
+    5.07
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.51
+    3.58

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -257,7 +269,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

    >>> zipf_frequency('owl-flavored', 'en')
-    3.18
+    3.19


 ## License
--- a/setup.py
+++ b/setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):

 setup(
    name="wordfreq",
-    version='1.4.2',
+    version='1.5',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
--- a/tests/test.py
+++ b/tests/test.py
@ -19,23 +19,43 @@ def test_freq_examples():
 def test_languages():
    # Make sure the number of available languages doesn't decrease
    avail = available_languages()
-    assert_greater(len(avail), 15)
+    assert_greater(len(avail), 26)

+    avail_twitter = available_languages('twitter')
+    assert_greater(len(avail_twitter), 15)
    # Look up a word representing laughter in each language, and make sure
-    # it has a non-zero frequency.
-    for lang in avail:
-        if lang in {'zh', 'ja'}:
+    # it has a non-zero frequency in the informal 'twitter' list.
+    for lang in avail_twitter:
+        if lang == 'zh' or lang == 'ja':
            text = '笑'
+        elif lang == 'ko':
+            text = 'ᄏᄏᄏ'
        elif lang == 'ar':
            text = 'ههههه'
+        elif lang == 'ca' or lang == 'es':
+            text = 'jaja'
+        elif lang in {'de', 'nb', 'sv', 'da'}:
+            text = 'haha'
+        elif lang == 'pt':
+            text = 'kkkk'
+        elif lang == 'he':
+            text = 'חחח'
+        elif lang == 'ru':
+            text = 'лол'
+        elif lang == 'bg':
+            text = 'хаха'
+        elif lang == 'ro':
+            text = 'haha'
+        elif lang == 'el':
+            text = 'χαχα'
        else:
            text = 'lol'
-        assert_greater(word_frequency(text, lang), 0)
+        assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))

        # Make up a weirdly verbose language code and make sure
        # we still get it
        new_lang_code = '%s-001-x-fake-extension' % lang.upper()
-        assert_greater(word_frequency(text, new_lang_code), 0, (text, new_lang_code))
+        assert_greater(word_frequency(text, new_lang_code, wordlist='twitter'), 0, (text, new_lang_code))


 def test_twitter():
@ -62,7 +82,7 @@ def test_most_common_words():
        """
        return top_n_list(lang, 1)[0]

-    eq_(get_most_common('ar'), 'في')
+    eq_(get_most_common('ar'), 'من')
    eq_(get_most_common('de'), 'die')
    eq_(get_most_common('en'), 'the')
    eq_(get_most_common('es'), 'de')
@ -144,12 +164,12 @@ def test_not_really_random():
    # This not only tests random_ascii_words, it makes sure we didn't end
    # up with 'eos' as a very common Japanese word
    eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
-        'rt rt rt rt')
+        '1 1 1 1')


@raises(ValueError)
 def test_not_enough_ascii():
-    random_ascii_words(lang='zh')
+    random_ascii_words(lang='zh', bits_per_word=14)


 def test_arabic():
@ -199,3 +219,10 @@ def test_other_languages():
    # Remove vowel points in Hebrew
    eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה'])

+    # Deal with commas, cedillas, and I's in Turkish
+    eq_(tokenize('kișinin', 'tr'), ['kişinin'])
+    eq_(tokenize('KİȘİNİN', 'tr'), ['kişinin'])
+
+    # Deal with cedillas that should be commas-below in Romanian
+    eq_(tokenize('acelaşi', 'ro'), ['același'])
+    eq_(tokenize('ACELAŞI', 'ro'), ['același'])
--- a/wordfreq/init.py
+++ b/wordfreq/init.py
@ -282,15 +282,15 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
    """
    Get the frequency of `word`, in the language with code `lang`, on the Zipf
    scale.
-    
+
    The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
    who compiled the SUBTLEX data. The goal of the Zipf scale is to map
    reasonable word frequencies to understandable, small positive numbers.
-    
+
    A word rates as x on the Zipf scale when it occurs 10**x times per billion
    words. For example, a word that occurs once per million words is at 3.0 on
    the Zipf scale.
-    
+
    Zipf values for reasonable words are between 0 and 8. The value this
    function returns will always be at last as large as `minimum`, even for a
    word that never appears. The default minimum is 0, representing words
--- a/wordfreq/data/combined_ar.msgpack.gz
+++ b/wordfreq/data/combined_ar.msgpack.gz
--- a/wordfreq/data/combined_bg.msgpack.gz
+++ b/wordfreq/data/combined_bg.msgpack.gz
--- a/wordfreq/data/combined_ca.msgpack.gz
+++ b/wordfreq/data/combined_ca.msgpack.gz
--- a/wordfreq/data/combined_da.msgpack.gz
+++ b/wordfreq/data/combined_da.msgpack.gz
--- a/wordfreq/data/combined_de.msgpack.gz
+++ b/wordfreq/data/combined_de.msgpack.gz
--- a/wordfreq/data/combined_el.msgpack.gz
+++ b/wordfreq/data/combined_el.msgpack.gz
--- a/wordfreq/data/combined_en.msgpack.gz
+++ b/wordfreq/data/combined_en.msgpack.gz
--- a/wordfreq/data/combined_es.msgpack.gz
+++ b/wordfreq/data/combined_es.msgpack.gz
--- a/wordfreq/data/combined_fi.msgpack.gz
+++ b/wordfreq/data/combined_fi.msgpack.gz
--- a/wordfreq/data/combined_fr.msgpack.gz
+++ b/wordfreq/data/combined_fr.msgpack.gz
--- a/wordfreq/data/combined_he.msgpack.gz
+++ b/wordfreq/data/combined_he.msgpack.gz
--- a/wordfreq/data/combined_hi.msgpack.gz
+++ b/wordfreq/data/combined_hi.msgpack.gz
--- a/wordfreq/data/combined_hu.msgpack.gz
+++ b/wordfreq/data/combined_hu.msgpack.gz
--- a/wordfreq/data/combined_id.msgpack.gz
+++ b/wordfreq/data/combined_id.msgpack.gz
--- a/wordfreq/data/combined_it.msgpack.gz
+++ b/wordfreq/data/combined_it.msgpack.gz
--- a/wordfreq/data/combined_ja.msgpack.gz
+++ b/wordfreq/data/combined_ja.msgpack.gz
--- a/wordfreq/data/combined_ko.msgpack.gz
+++ b/wordfreq/data/combined_ko.msgpack.gz
--- a/wordfreq/data/combined_ms.msgpack.gz
+++ b/wordfreq/data/combined_ms.msgpack.gz
--- a/wordfreq/data/combined_nb.msgpack.gz
+++ b/wordfreq/data/combined_nb.msgpack.gz
--- a/wordfreq/data/combined_nl.msgpack.gz
+++ b/wordfreq/data/combined_nl.msgpack.gz
--- a/wordfreq/data/combined_pl.msgpack.gz
+++ b/wordfreq/data/combined_pl.msgpack.gz
--- a/wordfreq/data/combined_pt.msgpack.gz
+++ b/wordfreq/data/combined_pt.msgpack.gz
--- a/wordfreq/data/combined_ro.msgpack.gz
+++ b/wordfreq/data/combined_ro.msgpack.gz
--- a/wordfreq/data/combined_ru.msgpack.gz
+++ b/wordfreq/data/combined_ru.msgpack.gz
--- a/wordfreq/data/combined_sv.msgpack.gz
+++ b/wordfreq/data/combined_sv.msgpack.gz
--- a/wordfreq/data/combined_tr.msgpack.gz
+++ b/wordfreq/data/combined_tr.msgpack.gz
--- a/wordfreq/data/combined_zh.msgpack.gz
+++ b/wordfreq/data/combined_zh.msgpack.gz
--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/large_ar.msgpack.gz
+++ b/wordfreq/data/large_ar.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_it.msgpack.gz
+++ b/wordfreq/data/large_it.msgpack.gz
--- a/wordfreq/data/large_nl.msgpack.gz
+++ b/wordfreq/data/large_nl.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/large_ru.msgpack.gz
+++ b/wordfreq/data/large_ru.msgpack.gz
--- a/wordfreq/data/twitter_ar.msgpack.gz
+++ b/wordfreq/data/twitter_ar.msgpack.gz
--- a/wordfreq/data/twitter_ca.msgpack.gz
+++ b/wordfreq/data/twitter_ca.msgpack.gz
--- a/wordfreq/data/twitter_de.msgpack.gz
+++ b/wordfreq/data/twitter_de.msgpack.gz
--- a/wordfreq/data/twitter_el.msgpack.gz
+++ b/wordfreq/data/twitter_el.msgpack.gz
--- a/wordfreq/data/twitter_en.msgpack.gz
+++ b/wordfreq/data/twitter_en.msgpack.gz
--- a/wordfreq/data/twitter_es.msgpack.gz
+++ b/wordfreq/data/twitter_es.msgpack.gz
--- a/wordfreq/data/twitter_fr.msgpack.gz
+++ b/wordfreq/data/twitter_fr.msgpack.gz
--- a/wordfreq/data/twitter_he.msgpack.gz
+++ b/wordfreq/data/twitter_he.msgpack.gz
--- a/wordfreq/data/twitter_hi.msgpack.gz
+++ b/wordfreq/data/twitter_hi.msgpack.gz
--- a/wordfreq/data/twitter_id.msgpack.gz
+++ b/wordfreq/data/twitter_id.msgpack.gz
--- a/wordfreq/data/twitter_it.msgpack.gz
+++ b/wordfreq/data/twitter_it.msgpack.gz
--- a/wordfreq/data/twitter_ja.msgpack.gz
+++ b/wordfreq/data/twitter_ja.msgpack.gz
--- a/wordfreq/data/twitter_ko.msgpack.gz
+++ b/wordfreq/data/twitter_ko.msgpack.gz
--- a/wordfreq/data/twitter_ms.msgpack.gz
+++ b/wordfreq/data/twitter_ms.msgpack.gz
--- a/wordfreq/data/twitter_nl.msgpack.gz
+++ b/wordfreq/data/twitter_nl.msgpack.gz
--- a/wordfreq/data/twitter_pl.msgpack.gz
+++ b/wordfreq/data/twitter_pl.msgpack.gz
--- a/wordfreq/data/twitter_pt.msgpack.gz
+++ b/wordfreq/data/twitter_pt.msgpack.gz
--- a/wordfreq/data/twitter_ru.msgpack.gz
+++ b/wordfreq/data/twitter_ru.msgpack.gz
--- a/wordfreq/data/twitter_sv.msgpack.gz
+++ b/wordfreq/data/twitter_sv.msgpack.gz
--- a/wordfreq/data/twitter_tr.msgpack.gz
+++ b/wordfreq/data/twitter_tr.msgpack.gz
--- a/wordfreq/mecab.py
+++ b/wordfreq/mecab.py
@ -23,6 +23,9 @@ def mecab_tokenize(text, lang):
        raise ValueError("Can't run MeCab on language %r" % lang)
    analyzer = MECAB_ANALYZERS[lang]
    text = unicodedata.normalize('NFKC', text.strip())
+    analyzed = analyzer.parse(text)
+    if not analyzed:
+        return []
    return [line.split('\t')[0]
-            for line in analyzer.parse(text).split('\n')
+            for line in analyzed.split('\n')
            if line != '' and line != 'EOS']
--- a/wordfreq/tokens.py
+++ b/wordfreq/tokens.py
@ -116,11 +116,26 @@ def simple_tokenize(text, include_punctuation=False):
 def turkish_tokenize(text, include_punctuation=False):
    """
    Like `simple_tokenize`, but modifies i's so that they case-fold correctly
-    in Turkish.
+    in Turkish, and modifies 'comma-below' characters to use cedillas.
    """
    text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
    token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
-    return [token.strip("'").casefold() for token in token_expr.findall(text)]
+    return [
+        commas_to_cedillas(token.strip("'").casefold())
+        for token in token_expr.findall(text)
+    ]
+
+
+def romanian_tokenize(text, include_punctuation=False):
+    """
+    Like `simple_tokenize`, but modifies the letters ş and ţ (with cedillas)
+    to use commas-below instead.
+    """
+    token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
+    return [
+        cedillas_to_commas(token.strip("'").casefold())
+        for token in token_expr.findall(text)
+    ]


 def tokenize_mecab_language(text, lang, include_punctuation=False):
@ -161,6 +176,34 @@ def remove_marks(text):
    return MARK_RE.sub('', text)


+def commas_to_cedillas(text):
+    """
+    Convert s and t with commas (ș and ț) to cedillas (ş and ţ), which is
+    preferred in Turkish.
+    """
+    return text.replace(
+        '\N{LATIN SMALL LETTER S WITH COMMA BELOW}',
+        '\N{LATIN SMALL LETTER S WITH CEDILLA}'
+    ).replace(
+        '\N{LATIN SMALL LETTER T WITH COMMA BELOW}',
+        '\N{LATIN SMALL LETTER T WITH CEDILLA}'
+    )
+
+
+def cedillas_to_commas(text):
+    """
+    Convert s and t with cedillas (ş and ţ) to commas (ș and ț), which is
+    preferred in Romanian.
+    """
+    return text.replace(
+        '\N{LATIN SMALL LETTER S WITH CEDILLA}',
+        '\N{LATIN SMALL LETTER S WITH COMMA BELOW}'
+    ).replace(
+        '\N{LATIN SMALL LETTER T WITH CEDILLA}',
+        '\N{LATIN SMALL LETTER T WITH COMMA BELOW}'
+    )
+
+
 def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
    """
    Tokenize this text in a way that's relatively simple but appropriate for
@ -263,6 +306,8 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
        return chinese_tokenize(text, include_punctuation, external_wordlist)
    elif lang == 'tr':
        return turkish_tokenize(text, include_punctuation)
+    elif lang == 'ro':
+        return romanian_tokenize(text, include_punctuation)
    elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}:
        # Abjad languages
        text = remove_marks(unicodedata.normalize('NFKC', text))
--- a/wordfreq_builder/rules.ninja
+++ b/wordfreq_builder/rules.ninja
@ -91,6 +91,9 @@ rule convert_google_syntactic_ngrams
 rule count
  command = python -m wordfreq_builder.cli.count_tokens $in $out

+rule count_langtagged
+  command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
+
 rule merge
  command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in

--- a/wordfreq_builder/wordfreq_builder/cli/count_tokens_langtagged.py
+++ b/wordfreq_builder/wordfreq_builder/cli/count_tokens_langtagged.py
@ -0,0 +1,21 @@
+"""
+Count tokens of text in a particular language, taking input from a
+tab-separated file whose first column is a language code. Lines in all
+languages except the specified one will be skipped.
+"""
+from wordfreq_builder.word_counts import count_tokens_langtagged, write_wordlist
+import argparse
+
+
+def handle_counts(filename_in, filename_out, lang):
+    counts = count_tokens_langtagged(filename_in, lang)
+    write_wordlist(counts, filename_out)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('filename_in', help='name of input file containing tokens')
+    parser.add_argument('filename_out', help='name of output file')
+    parser.add_argument('-l', '--language', help='language tag to filter lines for')
+    args = parser.parse_args()
+    handle_counts(args.filename_in, args.filename_out, args.language)
--- a/wordfreq_builder/wordfreq_builder/config.py
+++ b/wordfreq_builder/wordfreq_builder/config.py
@ -10,15 +10,20 @@ CONFIG = {
        #
        # Consider adding:
        # 'th' when we get tokenization for it
-        # 'hi' when we stop messing up its tokenization
        # 'tl' with one more data source
+        # 'el' if we can filter out kaomoji
        'twitter': [
-            'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
-            'pl', 'pt', 'ru', 'sv', 'tr'
+            'ar', 'ca', 'de', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
+            'ja', 'ko', 'ms', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr'
        ],
+        # Languages with large Wikipedias. (Languages whose Wikipedia dump is
+        # at least 200 MB of .xml.bz2 are included. Some widely-spoken
+        # languages with 100 MB are also included, specifically Malay and
+        # Hindi.)
        'wikipedia': [
-            'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
-            'pl', 'pt', 'ru', 'sv', 'tr'
+            'ar', 'ca', 'de', 'el', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
+            'ja', 'ko', 'ms', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh',
+            'bg', 'da', 'fi', 'hu', 'ro', 'uk'
        ],
        'opensubtitles': [
            # This list includes languages where the most common word in
@ -43,9 +48,20 @@ CONFIG = {
        'jieba': ['zh'],

        # About 99.2% of Reddit is in English. There are pockets of
-        # conversation in other languages, but we're concerned that they're not
+        # conversation in other languages, some of which may not be
        # representative enough for learning general word frequencies.
-        'reddit': ['en']
+        #
+        # However, there seem to be Spanish subreddits that are general enough
+        # (including /r/es and /r/mexico).
+        'reddit': ['en', 'es'],
+
+        # Well-represented languages in the Common Crawl
+        # It's possible we could add 'uk' to the list, needs more checking
+        'commoncrawl': [
+            'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'fa', 'fi', 'fr',
+            'he', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'ms', 'nb', 'nl',
+            'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'ta', 'tr', 'vi', 'zh'
+        ],
    },
    # Subtlex languages that need to be pre-processed
    'wordlist_paths': {
@ -54,6 +70,7 @@ CONFIG = {
        'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
        'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
        'google-books': 'generated/google-books/google_books_{lang}.{ext}',
+        'commoncrawl': 'generated/commoncrawl/commoncrawl_{lang}.{ext}',
        'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
        'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
        'jieba': 'generated/jieba/jieba_{lang}.{ext}',
@ -64,8 +81,15 @@ CONFIG = {
        'twitter-dist': 'dist/twitter_{lang}.{ext}',
        'jieba-dist': 'dist/jieba_{lang}.{ext}'
    },
-    'min_sources': 2,
-    'big-lists': ['en', 'fr', 'es', 'pt', 'de']
+    'min_sources': 3,
+    'big-lists': ['en', 'fr', 'es', 'pt', 'de', 'ar', 'it', 'nl', 'ru'],
+    # When dealing with language tags that come straight from cld2, we need
+    # to un-standardize a few of them
+    'cld2-language-aliases': {
+        'nb': 'no',
+        'he': 'iw',
+        'jw': 'jv'
+    }
 }


--- a/wordfreq_builder/wordfreq_builder/ninja.py
+++ b/wordfreq_builder/wordfreq_builder/ninja.py
@ -87,6 +87,10 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
            data_filename('source-lists/jieba'),
            CONFIG['sources']['jieba']
        ),
+        commoncrawl_deps(
+            data_filename('raw-input/commoncrawl'),
+            CONFIG['sources']['commoncrawl']
+        ),
        combine_lists(all_languages())
    ))

@ -117,6 +121,19 @@ def wikipedia_deps(dirname_in, languages):
    return lines


+def commoncrawl_deps(dirname_in, languages):
+    lines = []
+    for language in languages:
+        if language in CONFIG['cld2-language-aliases']:
+            language_alias = CONFIG['cld2-language-aliases'][language]
+        else:
+            language_alias = language
+        input_file = dirname_in + '/{}.txt.gz'.format(language_alias)
+        count_file = wordlist_filename('commoncrawl', language, 'counts.txt')
+        add_dep(lines, 'count_langtagged', input_file, count_file, params={'language': language_alias})
+    return lines
+
+
 def google_books_deps(dirname_in):
    # Get English data from the split-up files of the Google Syntactic N-grams
    # 2013 corpus.
--- a/wordfreq_builder/wordfreq_builder/word_counts.py
+++ b/wordfreq_builder/wordfreq_builder/word_counts.py
@ -2,10 +2,12 @@ from wordfreq import simple_tokenize, tokenize
 from collections import defaultdict
 from operator import itemgetter
 from ftfy import fix_text
+import statistics
 import math
 import csv
 import msgpack
 import gzip
+import unicodedata
 import regex


@ -36,6 +38,28 @@ def count_tokens(filename):
    return counts


+def count_tokens_langtagged(filename, lang):
+    """
+    Count tokens that appear in an already language-tagged file, in which each
+    line begins with a language code followed by a tab.
+    """
+    counts = defaultdict(int)
+    if filename.endswith('gz'):
+        infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
+    else:
+        infile = open(filename, encoding='utf-8', errors='replace')
+    for line in infile:
+        if '\t' not in line:
+            continue
+        line_lang, text = line.split('\t', 1)
+        if line_lang == lang:
+            tokens = tokenize(text.strip(), lang)
+            for token in tokens:
+                counts[token] += 1
+    infile.close()
+    return counts
+
+
 def read_values(filename, cutoff=0, max_words=1e8, lang=None):
    """
    Read words and their frequency or count values from a CSV file. Returns
@ -137,7 +161,7 @@ def merge_counts(count_dicts):
 def merge_freqs(freq_dicts):
    """
    Merge multiple dictionaries of frequencies, representing each word with
-    the word's average frequency over all sources.
+    the median of the word's frequency over all sources.
    """
    vocab = set()
    for freq_dict in freq_dicts:
@ -146,15 +170,45 @@ def merge_freqs(freq_dicts):
    merged = defaultdict(float)
    N = len(freq_dicts)
    for term in vocab:
-        term_total = 0.
+        freqs = []
+        missing_values = 0
        for freq_dict in freq_dicts:
-            term_total += freq_dict.get(term, 0.)
-        merged[term] = term_total / N
+            freq = freq_dict.get(term, 0.)
+            if freq < 1e-8:
+                # Usually we trust the median of the wordlists, but when at
+                # least 2 wordlists say a word exists and the rest say it
+                # doesn't, we kind of want to listen to the two that have
+                # information about the word. The word might be a word that's
+                # inconsistently accounted for, such as an emoji or a word
+                # containing an apostrophe.
+                #
+                # So, once we see at least 2 values that are very low or
+                # missing, we ignore further low values in the median. A word
+                # that appears in 2 sources gets a reasonable frequency, while
+                # a word that appears in 1 source still gets dropped.

+                missing_values += 1
+                if missing_values > 2:
+                    continue
+
+            freqs.append(freq)
+
+        if freqs:
+            median = statistics.median(freqs)
+            if median > 0.:
+                merged[term] = median
+
+    total = sum(merged.values())
+
+    # Normalize the merged values so that they add up to 0.99 (based on
+    # a rough estimate that 1% of tokens will be out-of-vocabulary in a
+    # wordlist of this size).
+    for term in merged:
+        merged[term] = merged[term] / total * 0.99
    return merged


-def write_wordlist(freqs, filename, cutoff=1e-8):
+def write_wordlist(freqs, filename, cutoff=1e-9):
    """
    Write a dictionary of either raw counts or frequencies to a file of
    comma-separated values.
@ -226,7 +280,6 @@ def correct_apostrophe_trimming(freqs):
    removed.
    """
    if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
-        print("Applying apostrophe trimming")
        for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
            if trim_word in freqs:
                freq = freqs[trim_word]