Merge pull request #27 from LuminosoInsight/chinese-and-more

Improve Chinese, Greek, English; add Turkish, Polish, Swedish
2024-12-23 17:31:41 +00:00 · 2015-09-24 13:25:21 -04:00 · 2015-09-24 13:25:21 -04:00 · 710eaabbe1
commit 710eaabbe1
parent acbb25e6f6 09597b7cf3
56 changed files with 36546 additions and 102 deletions
--- a/README.md
+++ b/README.md
@ -26,7 +26,7 @@ install them on Ubuntu:
 ## Usage

 wordfreq provides access to estimates of the frequency with which a word is
-used, in 16 languages (see *Supported languages* below). It loads
+used, in 18 languages (see *Supported languages* below). It loads
 efficiently-packed data structures that contain all words that appear at least
 once per million words.

@ -111,45 +111,49 @@ limiting the selection to words that can be typed in ASCII.

 ## Sources and supported languages

-We compiled word frequencies from five different sources, providing us examples
-of word usage on different topics at different levels of formality. The sources
-(and the abbreviations we'll use for them) are:
+We compiled word frequencies from seven different sources, providing us
+examples of word usage on different topics at different levels of formality.
+The sources (and the abbreviations we'll use for them) are:

- **GBooks**: Google Books Ngrams 2013
 - **LeedsIC**: The Leeds Internet Corpus
- **OpenSub**: OpenSubtitles
 - **SUBTLEX**: The SUBTLEX word frequency lists
+- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
 - **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015
+- **Wpedia**: The full text of Wikipedia in 2015
+- **Other**: We get additional English frequencies from Google Books Syntactic
+  Ngrams 2013, and Chinese frequencies from the frequency dictionary that
+  comes with the Jieba tokenizer.

-The following 14 languages are well-supported, with reasonable tokenization and
+The following 17 languages are well-supported, with reasonable tokenization and
 at least 3 different sources of word frequencies:

-    Language    Code    GBooks  SUBTLEX LeedsIC OpenSub Twitter Wikipedia
-    ──────────────────┼──────────────────────────────────────────────────
-    Arabic      ar    │ -       -       Yes     Yes     Yes     Yes
-    German      de    │ -       Yes     Yes     -       Yes[1]  Yes
-    Greek       el    │ -       -       Yes     Yes     Yes     Yes
-    English     en    │ Yes     Yes     Yes     Yes     Yes     Yes
-    Spanish     es    │ -       -       Yes     Yes     Yes     Yes
-    French      fr    │ -       -       Yes     Yes     Yes     Yes
-    Indonesian  id    │ -       -       -       Yes     Yes     Yes
-    Italian     it    │ -       -       Yes     Yes     Yes     Yes
-    Japanese    ja    │ -       -       Yes     -       Yes     Yes
-    Malay       ms    │ -       -       -       Yes     Yes     Yes
-    Dutch       nl    │ -       Yes     -       Yes     Yes     Yes
-    Portuguese  pt    │ -       -       Yes     Yes     Yes     Yes
-    Russian     ru    │ -       -       Yes     Yes     Yes     Yes
-    Turkish     tr    │ -       -       -       Yes     Yes     Yes
+    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Other
+    ──────────────────┼─────────────────────────────────────────────────────
+    Arabic      ar    │ -       Yes     Yes     Yes     Yes     -
+    German      de    │ Yes     -       Yes     Yes[1]  Yes     -
+    Greek       el    │ -       Yes     Yes     Yes     Yes     -
+    English     en    │ Yes     Yes     Yes     Yes     Yes     Google Books
+    Spanish     es    │ -       Yes     Yes     Yes     Yes     -
+    French      fr    │ -       Yes     Yes     Yes     Yes     -
+    Indonesian  id    │ -       Yes     -       Yes     Yes     -
+    Italian     it    │ -       Yes     Yes     Yes     Yes     -
+    Japanese    ja    │ -       -       Yes     Yes     Yes     -
+    Malay       ms    │ -       Yes     -       Yes     Yes     -
+    Dutch       nl    │ Yes     Yes     -       Yes     Yes     -
+    Polish      pl    │ -       Yes     -       Yes     Yes     -
+    Portuguese  pt    │ -       Yes     Yes     Yes     Yes     -
+    Russian     ru    │ -       Yes     Yes     Yes     Yes     -
+    Swedish     sv    │ -       Yes     -       Yes     Yes     -
+    Turkish     tr    │ -       Yes     -       Yes     Yes     -
+    Chinese     zh    │ Yes     -       Yes     -       -       Jieba

-These languages are only marginally supported so far. We have too few data
-sources so far in Korean (feel free to suggest some), and we are lacking
-tokenization support for Chinese.

-    Language    Code    GBooks  SUBTLEX LeedsIC OpenSub Twitter Wikipedia
-    ──────────────────┼──────────────────────────────────────────────────
-    Korean      ko    │ -       -       -       -       Yes     Yes
-    Chinese     zh    │ -       Yes     Yes     Yes     -       -
+Additionally, Korean is marginally supported. You can look up frequencies in
+it, but we have too few data sources for it so far:
+
+    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia
+    ──────────────────┼───────────────────────────────────────
+    Korean      ko    │ -       -       -       Yes     Yes

 [1] We've counted the frequencies from tweets in German, such as they are, but
 you should be aware that German is not a frequently-used language on Twitter.
@ -170,7 +174,8 @@ There are language-specific exceptions:
 - In Japanese, instead of using the regex library, it uses the external library
  `mecab-python3`. This is an optional dependency of wordfreq, and compiling
  it requires the `libmecab-dev` system package to be installed.
- It does not yet attempt to tokenize Chinese ideograms.
+- In Chinese, it uses the external Python library `jieba`, another optional
+  dependency.

 [uax29]: http://unicode.org/reports/tr29/

@ -182,10 +187,14 @@ also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

    >>> word_frequency('New York', 'en')
-    0.0002632772081925718
+    0.0002315934248950231
+    >>> word_frequency('北京地铁', 'zh')  # "Beijing Subway"
+    3.2187603965715087e-06

 The word frequencies are combined with the half-harmonic-mean function in order
-to provide an estimate of what their combined frequency would be.
+to provide an estimate of what their combined frequency would be. In languages
+written without spaces, there is also a penalty to the word frequency for each
+word break that must be inferred.

 This implicitly assumes that you're asking about words that frequently appear
 together. It's not multiplying the frequencies, because that would assume they
@ -223,14 +232,14 @@ sources:

 - Wikipedia, the free encyclopedia (http://www.wikipedia.org)

-It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
-SUBTLEX-CH, created by Marc Brysbaert et al. and available at
+It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
+SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
+(see citations below) and available at
 http://crr.ugent.be/programs-data/subtitle-frequencies.

-I (Rob Speer) have
-obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
-in wordfreq, to be used for any purpose, not just for academic use, under these
-conditions:
+I (Rob Speer) have obtained permission by e-mail from Marc Brysbaert to
+distribute these wordlists in wordfreq, to be used for any purpose, not just
+for academic use, under these conditions:

 - Wordfreq and code derived from it must credit the SUBTLEX authors.
 - It must remain clear that SUBTLEX is freely available data.
@ -254,6 +263,11 @@ Twitter; it does not display or republish any Twitter content.
  (2015). The word frequency effect. Experimental Psychology.
  http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea

+- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
+  (2011). The word frequency effect: A review of recent developments and
+  implications for the choice of frequency estimates in German. Experimental
+  Psychology, 58, 412-424.
+
 - Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
  frequencies based on film subtitles. PLoS One, 5(6), e10729.
  http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
@ -277,4 +291,3 @@ Twitter; it does not display or republish any Twitter content.
  SUBTLEX-UK: A new and improved word frequency database for British English.
  The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
  http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521
-
--- a/scripts/make_chinese_mapping.py
+++ b/scripts/make_chinese_mapping.py
@ -0,0 +1,50 @@
+"""
+Generate a msgpack file, _chinese_mapping.msgpack.gz, that maps Traditional
+Chinese characters to their Simplified Chinese equivalents.
+
+This is meant to be a normalization of text, somewhat like case-folding -- not
+an actual translator, a task for which this method would be unsuitable. We
+store word frequencies using Simplified Chinese characters so that, in the
+large number of cases where a Traditional Chinese word has an obvious
+Simplified Chinese mapping, we can get a frequency for it that's the same in
+Simplified and Traditional Chinese.
+
+Generating this mapping requires the external Chinese conversion tool OpenCC.
+"""
+import unicodedata
+import itertools
+import os
+import msgpack
+import gzip
+
+
+def make_hanzi_table(filename):
+    with open(filename, 'w', encoding='utf-8') as out:
+        for codept in itertools.chain(range(0x3400, 0xa000), range(0xf900, 0xfb00), range(0x20000, 0x30000)):
+            char = chr(codept)
+            if unicodedata.category(char) != 'Cn':
+                print('%5X\t%s' % (codept, char), file=out)
+
+
+def make_hanzi_converter(table_in, msgpack_out):
+    table = {}
+    with open(table_in, encoding='utf-8') as infile:
+        for line in infile:
+            hexcode, char = line.rstrip('\n').split('\t')
+            codept = int(hexcode, 16)
+            assert len(char) == 1
+            if chr(codept) != char:
+                table[codept] = char
+    with gzip.open(msgpack_out, 'wb') as outfile:
+        msgpack.dump(table, outfile, encoding='utf-8')
+
+
+def build():
+    make_hanzi_table('/tmp/han_in.txt')
+    os.system('opencc -c zht2zhs.ini < /tmp/han_in.txt > /tmp/han_out.txt')
+    make_hanzi_converter('/tmp/han_out.txt', '_chinese_mapping.msgpack.gz')
+
+
+if __name__ == '__main__':
+    build()
+
--- a/setup.py
+++ b/setup.py
@ -33,7 +33,7 @@ if sys.version_info < (3, 4):

 setup(
    name="wordfreq",
-    version='1.1',
+    version='1.2',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
@ -50,8 +50,11 @@ setup(
    # turn, it depends on libmecab-dev being installed on the system. It's not
    # listed under 'install_requires' because wordfreq should be usable in
    # other languages without it.
+    #
+    # Similarly, jieba is required for Chinese word frequencies.
    extras_require={
-        'mecab': 'mecab-python3'
+        'mecab': 'mecab-python3',
+        'jieba': 'jieba'
    },
-    tests_require=['mecab-python3'],
+    tests_require=['mecab-python3', 'jieba'],
 )
--- a/tests/test.py
+++ b/tests/test.py
@ -162,8 +162,8 @@ def test_ar():


 def test_ideographic_fallback():
-    # Try tokenizing Chinese text -- it should remain stuck together.
-    eq_(tokenize('中国文字', 'zh'), ['中国文字'])
+    # Try tokenizing Chinese text as English -- it should remain stuck together.
+    eq_(tokenize('中国文字', 'en'), ['中国文字'])

    # When Japanese is tagged with the wrong language, it will be split
    # at script boundaries.
--- a/tests/test_chinese.py
+++ b/tests/test_chinese.py
@ -0,0 +1,47 @@
+from nose.tools import eq_, assert_almost_equal, assert_greater
+from wordfreq import tokenize, word_frequency
+
+
+def test_tokens():
+    # Let's test on some Chinese text that has unusual combinations of
+    # syllables, because it is about an American vice-president.
+    #
+    # (He was the Chinese Wikipedia's featured article of the day when I
+    # wrote this test.)
+
+    hobart = '加勒特·霍巴特'  # Garret Hobart, or "jiā lè tè huò bā tè".
+
+    # He was the sixth American vice president to die in office.
+    fact_simplified  = '他是历史上第六位在任期内去世的美国副总统。'
+    fact_traditional = '他是歷史上第六位在任期內去世的美國副總統。'
+
+    # His name breaks into five pieces, with the only piece staying together
+    # being the one that means 'Bart'. The dot is not included as a token.
+    eq_(
+        tokenize(hobart, 'zh'),
+        ['加', '勒', '特', '霍', '巴特']
+    )
+
+    eq_(
+        tokenize(fact_simplified, 'zh'),
+        [
+         # he / is / in history / #6 / counter for people
+         '他', '是',  '历史上', '第六', '位',
+         # during / term of office / in / die
+         '在', '任期', '内', '去世',
+         # of / U.S. / deputy / president
+         '的', '美国', '副', '总统'
+        ]
+    )
+
+    # You match the same tokens if you look it up in Traditional Chinese.
+    eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
+    assert_greater(word_frequency(fact_traditional, 'zh'), 0)
+
+
+def test_combination():
+    xiexie_freq = word_frequency('谢谢', 'zh')   # "Thanks"
+    assert_almost_equal(
+        word_frequency('谢谢谢谢', 'zh'),
+        xiexie_freq / 20
+    )
--- a/wordfreq/init.py
+++ b/wordfreq/init.py
@ -15,6 +15,19 @@ logger = logging.getLogger(__name__)
 CACHE_SIZE = 100000
 DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))

+# Chinese and Japanese are written without spaces. In Chinese, in particular,
+# we have to infer word boundaries from the frequencies of the words they
+# would create. When this happens, we should adjust the resulting frequency
+# to avoid creating a bias toward improbable word combinations.
+INFERRED_SPACE_LANGUAGES = {'zh'}
+
+# We'll divide the frequency by 10 for each token boundary that was inferred.
+# (We determined the factor of 10 empirically by looking at words in the
+# Chinese wordlist that weren't common enough to be identified by the
+# tokenizer. These words would get split into multiple tokens, and their
+# inferred frequency would be on average 9.77 times higher than their actual
+# frequency.)
+INFERRED_SPACE_FACTOR = 10.0

 # simple_tokenize is imported so that other things can import it from here.
 # Suppress the pyflakes warning.
@ -80,10 +93,11 @@ def available_languages(wordlist='combined'):
    """
    available = {}
    for path in DATA_PATH.glob('*.msgpack.gz'):
-        list_name = path.name.split('.')[0]
-        name, lang = list_name.split('_')
-        if name == wordlist:
-            available[lang] = str(path)
+        if not path.name.startswith('_'):
+            list_name = path.name.split('.')[0]
+            name, lang = list_name.split('_')
+            if name == wordlist:
+                available[lang] = str(path)
    return available


@ -181,7 +195,12 @@ def _word_frequency(word, lang, wordlist, minimum):
            return minimum
        one_over_result += 1.0 / freqs[token]

-    return max(1.0 / one_over_result, minimum)
+    freq = 1.0 / one_over_result
+
+    if lang in INFERRED_SPACE_LANGUAGES:
+        freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
+
+    return max(freq, minimum)

 def word_frequency(word, lang, wordlist='combined', minimum=0.):
    """
--- a/wordfreq/chinese.py
+++ b/wordfreq/chinese.py
@ -0,0 +1,20 @@
+from pkg_resources import resource_filename
+import jieba
+import msgpack
+import gzip
+
+DICT_FILENAME = resource_filename('wordfreq', 'data/jieba_zh.txt')
+SIMP_MAP_FILENAME = resource_filename('wordfreq', 'data/_chinese_mapping.msgpack.gz')
+SIMPLIFIED_MAP = msgpack.load(gzip.open(SIMP_MAP_FILENAME), encoding='utf-8')
+jieba_tokenizer = None
+
+
+def simplify_chinese(text):
+    return text.translate(SIMPLIFIED_MAP).casefold()
+
+
+def jieba_tokenize(text):
+    global jieba_tokenizer
+    if jieba_tokenizer is None:
+        jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
+    return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
--- a/wordfreq/data/_chinese_mapping.msgpack.gz
+++ b/wordfreq/data/_chinese_mapping.msgpack.gz
--- a/wordfreq/data/combined_ar.msgpack.gz
+++ b/wordfreq/data/combined_ar.msgpack.gz
--- a/wordfreq/data/combined_de.msgpack.gz
+++ b/wordfreq/data/combined_de.msgpack.gz
--- a/wordfreq/data/combined_el.msgpack.gz
+++ b/wordfreq/data/combined_el.msgpack.gz
--- a/wordfreq/data/combined_en.msgpack.gz
+++ b/wordfreq/data/combined_en.msgpack.gz
--- a/wordfreq/data/combined_es.msgpack.gz
+++ b/wordfreq/data/combined_es.msgpack.gz
--- a/wordfreq/data/combined_fr.msgpack.gz
+++ b/wordfreq/data/combined_fr.msgpack.gz
--- a/wordfreq/data/combined_id.msgpack.gz
+++ b/wordfreq/data/combined_id.msgpack.gz
--- a/wordfreq/data/combined_it.msgpack.gz
+++ b/wordfreq/data/combined_it.msgpack.gz
--- a/wordfreq/data/combined_ja.msgpack.gz
+++ b/wordfreq/data/combined_ja.msgpack.gz
--- a/wordfreq/data/combined_ko.msgpack.gz
+++ b/wordfreq/data/combined_ko.msgpack.gz
--- a/wordfreq/data/combined_ms.msgpack.gz
+++ b/wordfreq/data/combined_ms.msgpack.gz
--- a/wordfreq/data/combined_nl.msgpack.gz
+++ b/wordfreq/data/combined_nl.msgpack.gz
--- a/wordfreq/data/combined_pl.msgpack.gz
+++ b/wordfreq/data/combined_pl.msgpack.gz
--- a/wordfreq/data/combined_pt.msgpack.gz
+++ b/wordfreq/data/combined_pt.msgpack.gz
--- a/wordfreq/data/combined_ru.msgpack.gz
+++ b/wordfreq/data/combined_ru.msgpack.gz
--- a/wordfreq/data/combined_sv.msgpack.gz
+++ b/wordfreq/data/combined_sv.msgpack.gz
--- a/wordfreq/data/combined_tr.msgpack.gz
+++ b/wordfreq/data/combined_tr.msgpack.gz
--- a/wordfreq/data/combined_zh.msgpack.gz
+++ b/wordfreq/data/combined_zh.msgpack.gz
--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/twitter_ar.msgpack.gz
+++ b/wordfreq/data/twitter_ar.msgpack.gz
--- a/wordfreq/data/twitter_de.msgpack.gz
+++ b/wordfreq/data/twitter_de.msgpack.gz
--- a/wordfreq/data/twitter_el.msgpack.gz
+++ b/wordfreq/data/twitter_el.msgpack.gz
--- a/wordfreq/data/twitter_en.msgpack.gz
+++ b/wordfreq/data/twitter_en.msgpack.gz
--- a/wordfreq/data/twitter_es.msgpack.gz
+++ b/wordfreq/data/twitter_es.msgpack.gz
--- a/wordfreq/data/twitter_fr.msgpack.gz
+++ b/wordfreq/data/twitter_fr.msgpack.gz
--- a/wordfreq/data/twitter_id.msgpack.gz
+++ b/wordfreq/data/twitter_id.msgpack.gz
--- a/wordfreq/data/twitter_it.msgpack.gz
+++ b/wordfreq/data/twitter_it.msgpack.gz
--- a/wordfreq/data/twitter_ja.msgpack.gz
+++ b/wordfreq/data/twitter_ja.msgpack.gz
--- a/wordfreq/data/twitter_ko.msgpack.gz
+++ b/wordfreq/data/twitter_ko.msgpack.gz
--- a/wordfreq/data/twitter_ms.msgpack.gz
+++ b/wordfreq/data/twitter_ms.msgpack.gz
--- a/wordfreq/data/twitter_nl.msgpack.gz
+++ b/wordfreq/data/twitter_nl.msgpack.gz
--- a/wordfreq/data/twitter_pl.msgpack.gz
+++ b/wordfreq/data/twitter_pl.msgpack.gz
--- a/wordfreq/data/twitter_pt.msgpack.gz
+++ b/wordfreq/data/twitter_pt.msgpack.gz
--- a/wordfreq/data/twitter_ru.msgpack.gz
+++ b/wordfreq/data/twitter_ru.msgpack.gz
--- a/wordfreq/data/twitter_sv.msgpack.gz
+++ b/wordfreq/data/twitter_sv.msgpack.gz
--- a/wordfreq/data/twitter_tr.msgpack.gz
+++ b/wordfreq/data/twitter_tr.msgpack.gz
--- a/wordfreq/japanese.py
+++ b/wordfreq/japanese.py
--- a/wordfreq/tokens.py
+++ b/wordfreq/tokens.py
@ -1,5 +1,6 @@
 import regex
 import unicodedata
+from pkg_resources import resource_filename


 TOKEN_RE = regex.compile(r"""
@ -87,6 +88,7 @@ def remove_arabic_marks(text):


 mecab_tokenize = None
+jieba_tokenize = None
 def tokenize(text, lang):
    """
    Tokenize this text in a way that's relatively simple but appropriate for
@ -115,8 +117,17 @@ def tokenize(text, lang):
    if lang == 'ja':
        global mecab_tokenize
        if mecab_tokenize is None:
-            from wordfreq.mecab import mecab_tokenize
-        return mecab_tokenize(text)
+            from wordfreq.japanese import mecab_tokenize
+        tokens = mecab_tokenize(text)
+        return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
+
+    if lang == 'zh':
+        global jieba_tokenize
+        if jieba_tokenize is None:
+            from wordfreq.chinese import jieba_tokenize
+        tokens = jieba_tokenize(text)
+        return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
+

    if lang == 'tr':
        return turkish_tokenize(text)
--- a/wordfreq_builder/build.png
+++ b/wordfreq_builder/build.png
--- a/wordfreq_builder/rules.ninja
+++ b/wordfreq_builder/rules.ninja
@ -32,10 +32,15 @@ rule wiki2text
  command = bunzip2 -c $in | wiki2text > $out

 # To tokenize Japanese, we run it through Mecab and take the first column.
-# We don't have a plan for tokenizing Chinese yet.
 rule tokenize_japanese
  command = mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out

+# Process Chinese by converting all Traditional Chinese characters to
+# Simplified equivalents -- not because that's a good way to get readable
+# text, but because that's how we're going to look them up.
+rule simplify_chinese
+  command = python -m wordfreq_builder.cli.simplify_chinese < $in > $out
+
 # Tokenizing text from Twitter requires us to language-detect and tokenize
 # in the same step.
 rule tokenize_twitter
@ -62,6 +67,13 @@ rule convert_opensubtitles
 rule convert_subtlex
  command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr '	",' ',  ' | grep -v 'â,' > $out

+rule convert_jieba
+  command = cut -d ' ' -f 1,2 $in | grep -v '[,"]' | tr ' ' ',' > $out
+
+rule counts_to_jieba
+  command = python -m wordfreq_builder.cli.counts_to_jieba $in $out
+
+
 # Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
 # the input files, keep only the single words and their counts, and only keep
 # lines with counts of 100 or more.
@ -77,13 +89,13 @@ rule count
  command = python -m wordfreq_builder.cli.count_tokens $in $out

 rule merge
-  command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff $in
+  command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in

 rule merge_counts
  command = python -m wordfreq_builder.cli.merge_counts -o $out $in

 rule freqs2cB
-  command = python -m wordfreq_builder.cli.freqs_to_cB $lang $in $out
+  command = python -m wordfreq_builder.cli.freqs_to_cB $in $out

 rule cat
  command = cat $in > $out
--- a/wordfreq_builder/wordfreq_builder/cli/counts_to_jieba.py
+++ b/wordfreq_builder/wordfreq_builder/cli/counts_to_jieba.py
@ -0,0 +1,15 @@
+from wordfreq_builder.word_counts import read_values, write_jieba
+import argparse
+
+
+def handle_counts(filename_in, filename_out):
+    freqs, total = read_values(filename_in, cutoff=1e-6)
+    write_jieba(freqs, filename_out)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('filename_in', help='name of input wordlist')
+    parser.add_argument('filename_out', help='name of output Jieba-compatible wordlist')
+    args = parser.parse_args()
+    handle_counts(args.filename_in, args.filename_out)
--- a/wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
+++ b/wordfreq_builder/wordfreq_builder/cli/freqs_to_cB.py
@ -4,8 +4,7 @@ import argparse

 if __name__ == '__main__':
    parser = argparse.ArgumentParser()
-    parser.add_argument('language', help='language of the input file')
    parser.add_argument('filename_in', help='name of input file containing tokens')
    parser.add_argument('filename_out', help='name of output file')
    args = parser.parse_args()
-    freqs_to_cBpack(args.filename_in, args.filename_out, lang=args.language)
+    freqs_to_cBpack(args.filename_in, args.filename_out)
--- a/wordfreq_builder/wordfreq_builder/cli/merge_freqs.py
+++ b/wordfreq_builder/wordfreq_builder/cli/merge_freqs.py
@ -2,10 +2,16 @@ from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
 import argparse


-def merge_lists(input_names, output_name, cutoff):
+def merge_lists(input_names, output_name, cutoff, lang):
    freq_dicts = []
+
+    # Don't use Chinese tokenization while building wordlists, as that would
+    # create a circular dependency.
+    if lang == 'zh':
+        lang = None
+
    for input_name in input_names:
-        freq_dicts.append(read_freqs(input_name, cutoff=cutoff))
+        freq_dicts.append(read_freqs(input_name, cutoff=cutoff, lang=lang))
    merged = merge_freqs(freq_dicts)
    write_wordlist(merged, output_name)

@ -14,7 +20,8 @@ if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--output', help='filename to write the output to', default='combined-freqs.csv')
    parser.add_argument('-c', '--cutoff', type=int, help='stop after seeing a count below this', default=2)
+    parser.add_argument('-l', '--language', help='language code for which language the words are in', default=None)
    parser.add_argument('inputs', help='names of input files to merge', nargs='+')
    args = parser.parse_args()
-    merge_lists(args.inputs, args.output, args.cutoff)
+    merge_lists(args.inputs, args.output, args.cutoff, args.language)

--- a/wordfreq_builder/wordfreq_builder/cli/simplify_chinese.py
+++ b/wordfreq_builder/wordfreq_builder/cli/simplify_chinese.py
@ -0,0 +1,11 @@
+from wordfreq.chinese import simplify_chinese
+import sys
+
+
+def main():
+    for line in sys.stdin:
+        sys.stdout.write(simplify_chinese(line))
+
+
+if __name__ == '__main__':
+    main()
--- a/wordfreq_builder/wordfreq_builder/config.py
+++ b/wordfreq_builder/wordfreq_builder/config.py
@ -1,35 +1,34 @@
 import os

 CONFIG = {
-    'version': '1.0b',
    # data_dir is a relative or absolute path to where the wordlist data
    # is stored
    'data_dir': 'data',
    'sources': {
-        # A list of language codes (possibly un-standardized) that we'll
-        # look up in filenames for these various data sources.
+        # A list of language codes that we'll look up in filenames for these
+        # various data sources.
        #
        # Consider adding:
        # 'th' when we get tokenization for it
        # 'hi' when we stop messing up its tokenization
-        # 'tl' because it's probably ready right now
-        # 'pl' because we have 3 sources for it
+        # 'tl' with one more data source
        'twitter': [
            'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
-            'pt', 'ru', 'tr'
+            'pl', 'pt', 'ru', 'sv', 'tr'
        ],
        'wikipedia': [
            'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
-            'pt', 'ru', 'tr'
+            'pl', 'pt', 'ru', 'sv', 'tr'
        ],
        'opensubtitles': [
            # This list includes languages where the most common word in
            # OpenSubtitles appears at least 5000 times. However, we exclude
-            # German, where SUBTLEX has done better processing of the same data.
+            # languages where SUBTLEX has apparently done a better job,
+            # specifically German and Chinese.
            'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
            'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
            'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
-            'sr', 'sv', 'tr', 'uk', 'zh'
+            'sr', 'sv', 'tr', 'uk'
        ],
        'leeds': [
            'ar', 'de', 'el', 'en', 'es', 'fr', 'it', 'ja', 'pt', 'ru', 'zh'
@ -41,6 +40,7 @@ CONFIG = {
        ],
        'subtlex-en': ['en'],
        'subtlex-other': ['de', 'nl', 'zh'],
+        'jieba': ['zh']
    },
    # Subtlex languages that need to be pre-processed
    'wordlist_paths': {
@ -51,9 +51,11 @@ CONFIG = {
        'google-books': 'generated/google-books/google_books_{lang}.{ext}',
        'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
        'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
+        'jieba': 'generated/jieba/jieba_{lang}.{ext}',
        'combined': 'generated/combined/combined_{lang}.{ext}',
        'combined-dist': 'dist/combined_{lang}.{ext}',
-        'twitter-dist': 'dist/twitter_{lang}.{ext}'
+        'twitter-dist': 'dist/twitter_{lang}.{ext}',
+        'jieba-dist': 'dist/jieba_{lang}.{ext}'
    },
    'min_sources': 2
 }
--- a/wordfreq_builder/wordfreq_builder/ninja.py
+++ b/wordfreq_builder/wordfreq_builder/ninja.py
@ -3,6 +3,7 @@ from wordfreq_builder.config import (
 )
 import sys
 import pathlib
+import itertools

 HEADER = """# This file is automatically generated. Do not edit it.
 # You can change its behavior by editing wordfreq_builder/ninja.py,
@ -45,51 +46,43 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
    # The first dependency is to make sure the build file is up to date.
    add_dep(lines, 'build_deps', 'rules.ninja', 'build.ninja',
            extra='wordfreq_builder/ninja.py')
-    lines.extend(
+    lines.extend(itertools.chain(
        twitter_deps(
            data_filename('raw-input/twitter/all-2014.txt'),
            slice_prefix=data_filename('slices/twitter/tweets-2014'),
            combined_prefix=data_filename('generated/twitter/tweets-2014'),
            slices=40,
            languages=CONFIG['sources']['twitter']
-        )
-    )
-    lines.extend(
+        ),
        wikipedia_deps(
            data_filename('raw-input/wikipedia'),
            CONFIG['sources']['wikipedia']
-        )
-    )
-    lines.extend(
+        ),
        google_books_deps(
            data_filename('raw-input/google-books')
-        )
-    )
-    lines.extend(
+        ),
        leeds_deps(
            data_filename('source-lists/leeds'),
            CONFIG['sources']['leeds']
-        )
-    )
-    lines.extend(
+        ),
        opensubtitles_deps(
            data_filename('source-lists/opensubtitles'),
            CONFIG['sources']['opensubtitles']
-        )
-    )
-    lines.extend(
+        ),
        subtlex_en_deps(
            data_filename('source-lists/subtlex'),
            CONFIG['sources']['subtlex-en']
-        )
-    )
-    lines.extend(
+        ),
        subtlex_other_deps(
            data_filename('source-lists/subtlex'),
            CONFIG['sources']['subtlex-other']
-        )
-    )
-    lines.extend(combine_lists(all_languages()))
+        ),
+        jieba_deps(
+            data_filename('source-lists/jieba'),
+            CONFIG['sources']['jieba']
+        ),
+        combine_lists(all_languages())
+    ))

    print('\n'.join(lines), file=out)

@ -189,8 +182,14 @@ def leeds_deps(dirname_in, languages):
        input_file = '{prefix}/internet-{lang}-forms.num'.format(
            prefix=dirname_in, lang=language
        )
+        if language == 'zh':
+            step2_file = wordlist_filename('leeds', 'zh-Hans', 'converted.txt')
+            add_dep(lines, 'simplify_chinese', input_file, step2_file)
+        else:
+            step2_file = input_file
+
        reformatted_file = wordlist_filename('leeds', language, 'counts.txt')
-        add_dep(lines, 'convert_leeds', input_file, reformatted_file)
+        add_dep(lines, 'convert_leeds', step2_file, reformatted_file)

    return lines

@ -201,14 +200,38 @@ def opensubtitles_deps(dirname_in, languages):
        input_file = '{prefix}/{lang}.txt'.format(
            prefix=dirname_in, lang=language
        )
+        if language == 'zh':
+            step2_file = wordlist_filename('opensubtitles', 'zh-Hans', 'converted.txt')
+            add_dep(lines, 'simplify_chinese', input_file, step2_file)
+        else:
+            step2_file = input_file
        reformatted_file = wordlist_filename(
            'opensubtitles', language, 'counts.txt'
        )
-        add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file)
+        add_dep(lines, 'convert_opensubtitles', step2_file, reformatted_file)

    return lines


+def jieba_deps(dirname_in, languages):
+    lines = []
+    # Because there's Chinese-specific handling here, the valid options for
+    # 'languages' are [] and ['zh']. Make sure it's one of those.
+    if not languages:
+        return lines
+    assert languages == ['zh']
+    input_file = '{prefix}/dict.txt.big'.format(prefix=dirname_in)
+    transformed_file = wordlist_filename(
+        'jieba', 'zh-Hans', 'converted.txt'
+    )
+    reformatted_file = wordlist_filename(
+        'jieba', 'zh', 'counts.txt'
+    )
+    add_dep(lines, 'simplify_chinese', input_file, transformed_file)
+    add_dep(lines, 'convert_jieba', transformed_file, reformatted_file)
+    return lines
+
+
 # Which columns of the SUBTLEX data files do the word and its frequency appear
 # in?
 SUBTLEX_COLUMN_MAP = {
@ -222,6 +245,9 @@ SUBTLEX_COLUMN_MAP = {

 def subtlex_en_deps(dirname_in, languages):
    lines = []
+    # Either subtlex_en is turned off, or it's just in English
+    if not languages:
+        return lines
    assert languages == ['en']
    regions = ['en-US', 'en-GB']
    processed_files = []
@ -253,10 +279,16 @@ def subtlex_other_deps(dirname_in, languages):
        output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
        textcol, freqcol = SUBTLEX_COLUMN_MAP[language]

+        if language == 'zh':
+            step2_file = wordlist_filename('subtlex-other', 'zh-Hans', 'converted.txt')
+            add_dep(lines, 'simplify_chinese', input_file, step2_file)
+        else:
+            step2_file = input_file
+
        # Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
        # I hope we don't need to configure this by language anymore.
        add_dep(
-            lines, 'convert_subtlex', input_file, processed_file,
+            lines, 'convert_subtlex', step2_file, processed_file,
            params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
        )
        add_dep(
@ -276,10 +308,11 @@ def combine_lists(languages):
        output_file = wordlist_filename('combined', language)
        add_dep(lines, 'merge', input_files, output_file,
                extra='wordfreq_builder/word_counts.py',
-                params={'cutoff': 2})
+                params={'cutoff': 2, 'lang': language})

        output_cBpack = wordlist_filename(
-            'combined-dist', language, 'msgpack.gz')
+            'combined-dist', language, 'msgpack.gz'
+        )
        add_dep(lines, 'freqs2cB', output_file, output_cBpack,
                extra='wordfreq_builder/word_counts.py',
                params={'lang': language})
@ -297,6 +330,12 @@ def combine_lists(languages):

            lines.append('default {}'.format(output_cBpack))

+    # Write a Jieba-compatible frequency file for Chinese tokenization
+    chinese_combined = wordlist_filename('combined', 'zh')
+    jieba_output = wordlist_filename('jieba-dist', 'zh')
+    add_dep(lines, 'counts_to_jieba', chinese_combined, jieba_output,
+            extra=['wordfreq_builder/word_counts.py', 'wordfreq_builder/cli/counts_to_jieba.py'])
+    lines.append('default {}'.format(jieba_output))
    return lines


--- a/wordfreq_builder/wordfreq_builder/tokenizers.py
+++ b/wordfreq_builder/wordfreq_builder/tokenizers.py
@ -32,6 +32,12 @@ def cld2_surface_tokenizer(text):
    text = TWITTER_HANDLE_RE.sub('', text)
    text = TCO_RE.sub('', text)
    lang = cld2_detect_language(text)
+
+    # Don't allow tokenization in Chinese when language-detecting, because
+    # the Chinese tokenizer may not be built yet
+    if lang == 'zh':
+        lang = 'en'
+
    tokens = tokenize(text, lang)
    return lang, tokens

--- a/wordfreq_builder/wordfreq_builder/word_counts.py
+++ b/wordfreq_builder/wordfreq_builder/word_counts.py
@ -12,6 +12,7 @@ import regex
 # Match common cases of URLs: the schema http:// or https:// followed by
 # non-whitespace characters.
 URL_RE = regex.compile(r'https?://(?:\S)+')
+HAN_RE = regex.compile(r'[\p{Script=Han}]+')


 def count_tokens(filename):
@ -42,8 +43,8 @@ def read_values(filename, cutoff=0, lang=None):
    If `cutoff` is greater than 0, the csv file must be sorted by value
    in descending order.

-    If lang is given, it will apply language specific preprocessing
-    operations.
+    If `lang` is given, it will apply language-specific tokenization to the
+    words that it reads.
    """
    values = defaultdict(float)
    total = 0.
@ -79,10 +80,13 @@ def read_freqs(filename, cutoff=0, lang=None):
    for word in values:
        values[word] /= total

+    if lang == 'en':
+        values = correct_apostrophe_trimming(values)
+
    return values


-def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
+def freqs_to_cBpack(in_filename, out_filename, cutoff=-600):
    """
    Convert a csv file of words and their frequencies to a file in the
    idiosyncratic 'cBpack' format.
@ -93,7 +97,7 @@ def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
    This cutoff should not be stacked with a cutoff in `read_freqs`; doing
    so would skew the resulting frequencies.
    """
-    freqs = read_freqs(in_filename, cutoff=0, lang=lang)
+    freqs = read_freqs(in_filename, cutoff=0, lang=None)
    cBpack = []
    for token, freq in freqs.items():
        cB = round(math.log10(freq) * 100)
@ -162,3 +166,65 @@ def write_wordlist(freqs, filename, cutoff=1e-8):
                break
            if not ('"' in word or ',' in word):
                writer.writerow([word, str(freq)])
+
+
+def write_jieba(freqs, filename):
+    """
+    Write a dictionary of frequencies in a format that can be used for Jieba
+    tokenization of Chinese.
+    """
+    with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
+        items = sorted(freqs.items(), key=lambda item: (-item[1], item[0]))
+        for word, freq in items:
+            if HAN_RE.search(word):
+                # Only store this word as a token if it contains at least one
+                # Han character.
+                fake_count = round(freq * 1e9)
+                print('%s %d' % (word, fake_count), file=outfile)
+
+
+# APOSTROPHE_TRIMMED_PROB represents the probability that this word has had
+# "'t" removed from it, based on counts from Twitter, which we know
+# accurate token counts for based on our own tokenizer.
+
+APOSTROPHE_TRIMMED_PROB = {
+    'don': 0.99,
+    'didn': 1.,
+    'can': 0.35,
+    'won': 0.74,
+    'isn': 1.,
+    'wasn': 1.,
+    'wouldn': 1.,
+    'doesn': 1.,
+    'couldn': 1.,
+    'ain': 0.99,
+    'aren': 1.,
+    'shouldn': 1.,
+    'haven': 0.96,
+    'weren': 1.,
+    'hadn': 1.,
+    'hasn': 1.,
+    'mustn': 1.,
+    'needn': 1.,
+}
+
+
+def correct_apostrophe_trimming(freqs):
+    """
+    If what we got was an English wordlist that has been tokenized with
+    apostrophes as token boundaries, as indicated by the frequencies of the
+    words "wouldn" and "couldn", then correct the spurious tokens we get by
+    adding "'t" in about the proportion we expect to see in the wordlist.
+
+    We could also adjust the frequency of "t", but then we would be favoring
+    the token "s" over it, as "'s" leaves behind no indication when it's been
+    removed.
+    """
+    if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
+        print("Applying apostrophe trimming")
+        for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
+            if trim_word in freqs:
+                freq = freqs[trim_word]
+                freqs[trim_word] = freq * (1 - trim_prob)
+                freqs[trim_word + "'t"] = freq * trim_prob
+    return freqs