Merge pull request #55 from LuminosoInsight/version2

Version 2, with standalone text pre-processing
2024-12-23 09:21:37 +00:00 · 2018-03-15 14:26:49 -04:00 · 2018-03-15 14:26:49 -04:00 · 18f176dbf6
commit 18f176dbf6
parent 72646f16a1 d9bc4af8cd
123 changed files with 26075 additions and 25716 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,66 @@
 ## Version 2.0 (2018-03-14)
 The big change in this version is that text preprocessing, tokenization, and
 postprocessing to look up words in a list are separate steps.
 If all you need is preprocessing to make text more consistent, use
 `wordfreq.preprocess.preprocess_text(text, lang)`. If you need preprocessing
 and tokenization, use `wordfreq.tokenize(text, lang)` as before. If you need
 all three steps, use the new function `wordfreq.lossy_tokenize(text, lang)`.
 As a breaking change, this means that the `tokenize` function no longer has
 the `combine_numbers` option, because that's a postprocessing step. For
 the same behavior, use `lossy_tokenize`, which always combines numbers.
 Similarly, `tokenize` will no longer replace Chinese characters with their
 Simplified Chinese version, while `lossy_tokenize` will.
 Other changes:
 - There's a new default wordlist for each language, called "best". This
  chooses the "large" wordlist for that language, or if that list doesn't
  exist, it falls back on "small".
 - The wordlist formerly named "combined" (this name made sense long ago)
  is now named "small". "combined" remains as a deprecated alias.
 - The "twitter" wordlist has been removed. If you need to compare word
  frequencies from individual sources, you can work with the separate files in
  [exquisite-corpus][].
 - Tokenizing Chinese will preserve the original characters, no matter whether
  they are Simplified or Traditional, instead of replacing them all with
  Simplified characters.
 - Different languages require different processing steps, and the decisions
  about what these steps are now appear in the `wordfreq.language_info` module,
  replacing a bunch of scattered and inconsistent `if` statements.
 - Tokenizing CJK languages while preserving punctuation now has a less confusing
  implementation.
 - The preprocessing step can transliterate Azerbaijani, although we don't yet
  have wordlists in this language. This is similar to how the tokenizer
  supports many more languages than the ones with wordlists, making future
  wordlists possible.
 - Speaking of that, the tokenizer will log a warning (once) if you ask to tokenize
  text written in a script we can't tokenize (such as Thai).
 - New source data from [exquisite-corpus][] includes OPUS OpenSubtitles 2018.
 Nitty gritty dependency changes:
 - Updated the regex dependency to 2018.02.21. (We would love suggestions on
  how to coexist with other libraries that use other versions of `regex`,
  without a `>=` requirement that could introduce unexpected data-altering
  changes.)
 - We now depend on `msgpack`, the new name for `msgpack-python`.
 [exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
 ## Version 1.7.0 (2017-08-25)
 - Tokenization will always keep Unicode graphemes together, including
--- a/README.md
+++ b/README.md
@ -7,7 +7,7 @@ Author: Robyn Speer
 ## Installation
 wordfreq requires Python 3 and depends on a few other Python modules
-(msgpack, langcodes, and ftfy). You can install it and its dependencies
+(msgpack, langcodes, and regex). You can install it and its dependencies
 in the usual way, either by getting it from pip:
    pip3 install wordfreq
@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
 ## Usage
 wordfreq provides access to estimates of the frequency with which a word is
-used, in 27 languages (see *Supported languages* below).
+used, in 35 languages (see *Supported languages* below).
-It provides three kinds of pre-built wordlists:
+It provides both 'small' and 'large' wordlists:
- `'combined'` lists, containing words that appear at least once per
+- The 'small' lists take up very little memory and cover words that appear at
-  million words, averaged across all data sources.
+  least once per million words.
- `'twitter'` lists, containing words that appear at least once per
+- The 'large' lists cover words that appear at least once per 100 million
-  million words on Twitter alone.
+  words.
 - `'large'` lists, containing words that appear at least once per 100
  million words, averaged across all data sources.
-The most straightforward function is:
+The default list is 'best', which uses 'large' if it's available for the
 language, and 'small' otherwise.
-    word_frequency(word, lang, wordlist='combined', minimum=0.0)
+The most straightforward function for looking up frequencies is:
    word_frequency(word, lang, wordlist='best', minimum=0.0)
 This function looks up a word's frequency in the given language, returning its
 frequency as a decimal between 0 and 1. In these examples, we'll multiply the
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
    11.748975549395302
    >>> word_frequency('café', 'en') * 1e6
-    3.981071705534969
+    3.890451449942805
    >>> word_frequency('cafe', 'fr') * 1e6
-    1.4125375446227555
+    1.4454397707459279
    >>> word_frequency('café', 'fr') * 1e6
    53.70317963702532
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
 Reasonable Zipf values are between 0 and 8, but because of the cutoffs
 described above, the minimum Zipf value appearing in these lists is 1.0 for the
-'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
+'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
 for words that do not appear in the given wordlist, although it should mean
 one occurrence per billion words.
    >>> from wordfreq import zipf_frequency
    >>> zipf_frequency('the', 'en')
-    7.75
+    7.77
    >>> zipf_frequency('word', 'en')
    5.32
    >>> zipf_frequency('frequency', 'en')
-    4.36
+    4.38
    >>> zipf_frequency('zipf', 'en')
-    0.0
+    1.32
-    >>> zipf_frequency('zipf', 'en', wordlist='large')
+    >>> zipf_frequency('zipf', 'en', wordlist='small')
-    1.28
+    0.0
 The parameters to `word_frequency` and `zipf_frequency` are:
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
 - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
 - `wordlist`: which set of word frequencies to use. Current options are
-  'combined', 'twitter', and 'large'.
+  'small', 'large', and 'best'.
 - `minimum`: If the word is not in the list or has a frequency lower than
  `minimum`, return `minimum` instead. You may want to set this to the minimum
@ -108,7 +109,7 @@ Other functions:
 way that the words in wordfreq's data were counted in the first place. See
 *Tokenization*.
-`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
+`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
 the list, in descending frequency order.
    >>> from wordfreq import top_n_list
@ -118,18 +119,18 @@ the list, in descending frequency order.
    >>> top_n_list('es', 10)
    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
-`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
+`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
 wordlist, in descending frequency order.
-`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
+`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
 a wordlist as a dictionary, for cases where you'll want to look up a lot of
 words and don't need the wrapper that `word_frequency` provides.
-`supported_languages(wordlist='combined')` returns a dictionary whose keys are
+`supported_languages(wordlist='best')` returns a dictionary whose keys are
 language codes, and whose values are the data file that will be loaded to
 provide the requested wordlist in each language.
-`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
+`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
 returns a selection of random words, separated by spaces. `bits_per_word=n`
 will select each random word from 2^n words.
@ -256,7 +257,7 @@ into multiple tokens:
    >>> zipf_frequency('New York', 'en')
    5.35
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.55
+    3.54
 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
--- a/setup.cfg
+++ b/setup.cfg
@ -0,0 +1,5 @@
 [nosetests]
 verbosity=2
 with-doctest=1
 with-coverage=0
 cover-package=wordfreq
--- a/setup.py
+++ b/setup.py
@ -28,7 +28,7 @@ README_contents = open(os.path.join(current_dir, 'README.md'),
                       encoding='utf-8').read()
 doclines = README_contents.split("\n")
 dependencies = [
-    'ftfy >= 5', 'msgpack', 'langcodes >= 1.4', 'regex == 2017.07.28'
+    'msgpack', 'langcodes >= 1.4.1', 'regex == 2018.02.21'
 ]
 if sys.version_info < (3, 4):
    dependencies.append('pathlib')
@ -36,7 +36,7 @@ if sys.version_info < (3, 4):
 setup(
    name="wordfreq",
-    version='1.7.0',
+    version='2.0',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
--- a/tests/test.py
+++ b/tests/test.py
@ -1,9 +1,9 @@
 from wordfreq import (
    word_frequency, available_languages, cB_to_freq,
-    top_n_list, random_words, random_ascii_words, tokenize
+    top_n_list, random_words, random_ascii_words, tokenize, lossy_tokenize
 )
 from nose.tools import (
-    eq_, assert_almost_equal, assert_greater, raises
+    eq_, assert_almost_equal, assert_greater, raises, assert_not_equal
 )
@ -15,35 +15,29 @@ def test_freq_examples():
    assert_greater(word_frequency('de', 'es'),
                   word_frequency('the', 'es'))
-
+    # We get word frequencies from the 'large' list when available
-# To test the reasonableness of the Twitter list, we want to look up a
+    assert_greater(word_frequency('infrequency', 'en'), 0.)
 # common word representing laughter in each language. The default for
 # languages not listed here is 'haha'.
 LAUGHTER_WORDS = {
    'en': 'lol',
    'hi': 'lol',
    'cs': 'lol',
    'ru': 'лол',
    'zh': '笑',
    'ja': '笑',
    'ar': 'ﻪﻬﻬﻬﻫ',
    'fa': 'خخخخ',
    'ca': 'jaja',
    'es': 'jaja',
    'fr': 'ptdr',
    'pt': 'kkkk',
    'he': 'חחח',
    'bg': 'ахаха',
    'uk': 'хаха',
    'bn': 'হা হা',
    'mk': 'хаха'
 }
 def test_languages():
-    # Make sure the number of available languages doesn't decrease
+    # Make sure we get all the languages when looking for the default
    # 'best' wordlist
    avail = available_languages()
-    assert_greater(len(avail), 26)
+    assert_greater(len(avail), 32)
    # 'small' covers the same languages, but with some different lists
    avail_small = available_languages('small')
    eq_(len(avail_small), len(avail))
    assert_not_equal(avail_small, avail)
    # 'combined' is the same as 'small'
    avail_old_name = available_languages('combined')
    eq_(avail_old_name, avail_small)
    # 'large' covers fewer languages
    avail_large = available_languages('large')
    assert_greater(len(avail_large), 12)
    assert_greater(len(avail), len(avail_large))
    # Look up the digit '2' in the main word list for each language
    for lang in avail:
@ -55,17 +49,6 @@ def test_languages():
        assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code)
 def test_twitter():
    avail = available_languages('twitter')
    assert_greater(len(avail), 15)
    for lang in avail:
        assert_greater(word_frequency('rt', lang, 'twitter'),
                       word_frequency('rt', lang, 'combined'))
        text = LAUGHTER_WORDS.get(lang, 'haha')
        assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
 def test_minimums():
    eq_(word_frequency('esquivalience', 'en'), 0)
    eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6)
@ -164,13 +147,13 @@ def test_casefolding():
 def test_number_smashing():
    eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
        ['715', 'crσσks', 'by', 'bon', 'iver'])
-    eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True),
+    eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
        ['000', 'crσσks', 'by', 'bon', 'iver'])
-    eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True, include_punctuation=True),
+    eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', include_punctuation=True),
        ['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver'])
-    eq_(tokenize('1', 'en', combine_numbers=True), ['1'])
+    eq_(lossy_tokenize('1', 'en'), ['1'])
-    eq_(tokenize('3.14', 'en', combine_numbers=True), ['0.00'])
+    eq_(lossy_tokenize('3.14', 'en'), ['0.00'])
-    eq_(tokenize('24601', 'en', combine_numbers=True), ['00000'])
+    eq_(lossy_tokenize('24601', 'en'), ['00000'])
    eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en'))
@ -231,6 +214,7 @@ def test_ideographic_fallback():
        ['ひらがな', 'カタカナ', 'romaji']
    )
 def test_other_languages():
    # Test that we leave Thai letters stuck together. If we had better Thai support,
    # we would actually split this into a three-word phrase.
--- a/tests/test_chinese.py
+++ b/tests/test_chinese.py
@ -55,10 +55,19 @@ def test_tokens():
        ]
    )
-    # You match the same tokens if you look it up in Traditional Chinese.
+    # Check that Traditional Chinese works at all
    eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
    assert_greater(word_frequency(fact_traditional, 'zh'), 0)
    # You get the same token lengths if you look it up in Traditional Chinese,
    # but the words are different
    simp_tokens = tokenize(fact_simplified, 'zh', include_punctuation=True)
    trad_tokens = tokenize(fact_traditional, 'zh', include_punctuation=True)
    eq_(''.join(simp_tokens), fact_simplified)
    eq_(''.join(trad_tokens), fact_traditional)
    simp_lengths = [len(token) for token in simp_tokens]
    trad_lengths = [len(token) for token in trad_tokens]
    eq_(simp_lengths, trad_lengths)
 def test_combination():
    xiexie_freq = word_frequency('谢谢', 'zh')   # "Thanks"
@ -83,5 +92,3 @@ def test_alternate_codes():
    # Separate codes for Mandarin and Cantonese
    eq_(tokenize('谢谢谢谢', 'cmn'), tokens)
    eq_(tokenize('谢谢谢谢', 'yue'), tokens)
--- a/tests/test_transliteration.py
+++ b/tests/test_transliteration.py
@ -1,5 +1,6 @@
 from nose.tools import eq_
 from wordfreq import tokenize
 from wordfreq.preprocess import preprocess_text
 def test_transliteration():
@ -10,6 +11,21 @@ def test_transliteration():
    eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'),
        ['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš'])
    # I don't have examples of complete sentences in Azerbaijani that are
    # naturally in Cyrillic, because it turns out everyone writes Azerbaijani
    # in Latin letters on the Internet, _except_ sometimes for Wiktionary.
    # So here are some individual words.
    # 'library' in Azerbaijani Cyrillic
    eq_(preprocess_text('китабхана', 'az'), 'kitabxana')
    eq_(preprocess_text('КИТАБХАНА', 'az'), 'kitabxana')
    eq_(preprocess_text('KİTABXANA', 'az'), 'kitabxana')
    # 'scream' in Azerbaijani Cyrillic
    eq_(preprocess_text('бағырты', 'az'), 'bağırtı')
    eq_(preprocess_text('БАҒЫРТЫ', 'az'), 'bağırtı')
    eq_(preprocess_text('BAĞIRTI', 'az'), 'bağırtı')
 def test_actually_russian():
    # This looks mostly like Serbian, but was probably actually Russian.
--- a/wordfreq/init.py
+++ b/wordfreq/init.py
@ -1,4 +1,3 @@
 from wordfreq.tokens import tokenize, simple_tokenize
 from pkg_resources import resource_filename
 from functools import lru_cache
 import langcodes
@ -10,18 +9,15 @@ import random
 import logging
 import math
 from .tokens import tokenize, simple_tokenize, lossy_tokenize
 from .language_info import get_language_info
 logger = logging.getLogger(__name__)
 CACHE_SIZE = 100000
 DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
 # Chinese and Japanese are written without spaces. In Chinese, in particular,
 # we have to infer word boundaries from the frequencies of the words they
 # would create. When this happens, we should adjust the resulting frequency
 # to avoid creating a bias toward improbable word combinations.
 INFERRED_SPACE_LANGUAGES = {'zh'}
 # We'll divide the frequency by 10 for each token boundary that was inferred.
 # (We determined the factor of 10 empirically by looking at words in the
 # Chinese wordlist that weren't common enough to be identified by the
@ -30,8 +26,9 @@ INFERRED_SPACE_LANGUAGES = {'zh'}
 # frequency.)
 INFERRED_SPACE_FACTOR = 10.0
-# simple_tokenize is imported so that other things can import it from here.
+# tokenize and simple_tokenize are imported so that other things can import
-# Suppress the pyflakes warning.
+# them from here. Suppress the pyflakes warning.
 tokenize = tokenize
 simple_tokenize = simple_tokenize
@ -87,11 +84,21 @@ def read_cBpack(filename):
    return data[1:]
-def available_languages(wordlist='combined'):
+def available_languages(wordlist='best'):
    """
-    List the languages (as language-code strings) that the wordlist of a given
+    Given a wordlist name, return a dictionary of language codes to filenames,
-    name is available in.
+    representing all the languages in which that wordlist is available.
    """
    if wordlist == 'best':
        available = available_languages('small')
        available.update(available_languages('large'))
        return available
    elif wordlist == 'combined':
        logger.warning(
            "The 'combined' wordlists have been renamed to 'small'."
        )
        wordlist = 'small'
    available = {}
    for path in DATA_PATH.glob('*.msgpack.gz'):
        if not path.name.startswith('_'):
@ -103,7 +110,7 @@ def available_languages(wordlist='combined'):
@lru_cache(maxsize=None)
-def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
+def get_frequency_list(lang, wordlist='best', match_cutoff=30):
    """
    Read the raw data from a wordlist file, returning it as a list of
    lists. (See `read_cBpack` for what this represents.)
@ -117,7 +124,8 @@ def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
    best, score = langcodes.best_match(lang, list(available),
                                       min_score=match_cutoff)
    if score == 0:
-        raise LookupError("No wordlist available for language %r" % lang)
+        raise LookupError("No wordlist %r available for language %r"
                          % (wordlist, lang))
    if best != lang:
        logger.warning(
@ -184,7 +192,7 @@ def freq_to_zipf(freq):
@lru_cache(maxsize=None)
-def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
+def get_frequency_dict(lang, wordlist='best', match_cutoff=30):
    """
    Get a word frequency list as a dictionary, mapping tokens to
    frequencies as floating-point probabilities.
@ -198,7 +206,7 @@ def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
    return freqs
-def iter_wordlist(lang, wordlist='combined'):
+def iter_wordlist(lang, wordlist='best'):
    """
    Yield the words in a wordlist in approximate descending order of
    frequency.
@ -215,8 +223,9 @@ def iter_wordlist(lang, wordlist='combined'):
 # it takes to look up frequencies from scratch, so something faster is needed.
 _wf_cache = {}
 def _word_frequency(word, lang, wordlist, minimum):
-    tokens = tokenize(word, lang, combine_numbers=True)
+    tokens = lossy_tokenize(word, lang)
    if not tokens:
        return minimum
@ -234,39 +243,31 @@ def _word_frequency(word, lang, wordlist, minimum):
    freq = 1.0 / one_over_result
-    if lang in INFERRED_SPACE_LANGUAGES:
+    if get_language_info(lang)['tokenizer'] == 'jieba':
        # If we used the Jieba tokenizer, we could tokenize anything to match
        # our wordlist, even nonsense. To counteract this, we multiply by a
        # probability for each word break that was inferred.
        freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
    return max(freq, minimum)
-def word_frequency(word, lang, wordlist='combined', minimum=0.):
+def word_frequency(word, lang, wordlist='best', minimum=0.):
    """
    Get the frequency of `word` in the language with code `lang`, from the
-    specified `wordlist`. The default wordlist is 'combined', built from
+    specified `wordlist`.
    whichever of these five sources have sufficient data for the language:
-      - Full text of Wikipedia
+    These wordlists can be specified:
      - A sample of 72 million tweets collected from Twitter in 2014,
        divided roughly into languages using automatic language detection
      - Frequencies extracted from OpenSubtitles
      - The Leeds Internet Corpus
      - Google Books Syntactic Ngrams 2013
-    Another available wordlist is 'twitter', which uses only the data from
+    - 'large': a wordlist built from at least 5 sources, containing word
-    Twitter.
+      frequencies of 10^-8 and higher
-
+    - 'small': a wordlist built from at least 3 sources, containing word
-    Words that we believe occur at least once per million tokens, based on
+      frquencies of 10^-6 and higher
-    the average of these lists, will appear in the word frequency list.
+    - 'best': uses 'large' if available, and 'small' otherwise
    The value returned will always be at least as large as `minimum`.
-
+    You could set this value to 10^-8, for example, to return 10^-8 for
-    If a word decomposes into multiple tokens, we'll return a smoothed estimate
+    unknown words in the 'large' list instead of 0, avoiding a discontinuity.
    of the word frequency that is no greater than the frequency of any of its
    individual tokens.
    It should be noted that the current tokenizer does not support
    multi-word Chinese phrases.
    """
    args = (word, lang, wordlist, minimum)
    try:
@ -278,7 +279,7 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
        return _wf_cache[args]
-def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
+def zipf_frequency(word, lang, wordlist='best', minimum=0.):
    """
    Get the frequency of `word`, in the language with code `lang`, on the Zipf
    scale.
@ -306,7 +307,7 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
@lru_cache(maxsize=100)
-def top_n_list(lang, n, wordlist='combined', ascii_only=False):
+def top_n_list(lang, n, wordlist='best', ascii_only=False):
    """
    Return a frequency list of length `n` in descending order of frequency.
    This list contains words from `wordlist`, of the given language.
@ -321,7 +322,7 @@ def top_n_list(lang, n, wordlist='combined', ascii_only=False):
    return results
-def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
+def random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12,
                 ascii_only=False):
    """
    Returns a string of random, space separated words.
@ -346,7 +347,7 @@ def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
    return ' '.join([random.choice(choices) for i in range(nwords)])
-def random_ascii_words(lang='en', wordlist='combined', nwords=5,
+def random_ascii_words(lang='en', wordlist='best', nwords=5,
                       bits_per_word=12):
    """
    Returns a string of random, space separated, ASCII words.
--- a/wordfreq/chinese.py
+++ b/wordfreq/chinese.py
@ -49,4 +49,11 @@ def jieba_tokenize(text, external_wordlist=False):
    else:
        if jieba_tokenizer is None:
            jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
-        return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
+
        # Tokenize the Simplified Chinese version of the text, but return
        # those spans from the original text, even if it's in Traditional
        # Chinese
        tokens = []
        for _token, start, end in jieba_tokenizer.tokenize(simplify_chinese(text), HMM=False):
            tokens.append(text[start:end])
        return tokens
--- a/wordfreq/data/combined_ar.msgpack.gz
+++ b/wordfreq/data/combined_ar.msgpack.gz
--- a/wordfreq/data/combined_bg.msgpack.gz
+++ b/wordfreq/data/combined_bg.msgpack.gz
--- a/wordfreq/data/combined_bn.msgpack.gz
+++ b/wordfreq/data/combined_bn.msgpack.gz
--- a/wordfreq/data/combined_ca.msgpack.gz
+++ b/wordfreq/data/combined_ca.msgpack.gz
--- a/wordfreq/data/combined_cs.msgpack.gz
+++ b/wordfreq/data/combined_cs.msgpack.gz
--- a/wordfreq/data/combined_da.msgpack.gz
+++ b/wordfreq/data/combined_da.msgpack.gz
--- a/wordfreq/data/combined_de.msgpack.gz
+++ b/wordfreq/data/combined_de.msgpack.gz
--- a/wordfreq/data/combined_el.msgpack.gz
+++ b/wordfreq/data/combined_el.msgpack.gz
--- a/wordfreq/data/combined_en.msgpack.gz
+++ b/wordfreq/data/combined_en.msgpack.gz
--- a/wordfreq/data/combined_es.msgpack.gz
+++ b/wordfreq/data/combined_es.msgpack.gz
--- a/wordfreq/data/combined_fa.msgpack.gz
+++ b/wordfreq/data/combined_fa.msgpack.gz
--- a/wordfreq/data/combined_fi.msgpack.gz
+++ b/wordfreq/data/combined_fi.msgpack.gz
--- a/wordfreq/data/combined_fr.msgpack.gz
+++ b/wordfreq/data/combined_fr.msgpack.gz
--- a/wordfreq/data/combined_he.msgpack.gz
+++ b/wordfreq/data/combined_he.msgpack.gz
--- a/wordfreq/data/combined_hi.msgpack.gz
+++ b/wordfreq/data/combined_hi.msgpack.gz
--- a/wordfreq/data/combined_hu.msgpack.gz
+++ b/wordfreq/data/combined_hu.msgpack.gz
--- a/wordfreq/data/combined_id.msgpack.gz
+++ b/wordfreq/data/combined_id.msgpack.gz
--- a/wordfreq/data/combined_it.msgpack.gz
+++ b/wordfreq/data/combined_it.msgpack.gz
--- a/wordfreq/data/combined_ja.msgpack.gz
+++ b/wordfreq/data/combined_ja.msgpack.gz
--- a/wordfreq/data/combined_ko.msgpack.gz
+++ b/wordfreq/data/combined_ko.msgpack.gz
--- a/wordfreq/data/combined_mk.msgpack.gz
+++ b/wordfreq/data/combined_mk.msgpack.gz
--- a/wordfreq/data/combined_ms.msgpack.gz
+++ b/wordfreq/data/combined_ms.msgpack.gz
--- a/wordfreq/data/combined_nb.msgpack.gz
+++ b/wordfreq/data/combined_nb.msgpack.gz
--- a/wordfreq/data/combined_nl.msgpack.gz
+++ b/wordfreq/data/combined_nl.msgpack.gz
--- a/wordfreq/data/combined_pl.msgpack.gz
+++ b/wordfreq/data/combined_pl.msgpack.gz
--- a/wordfreq/data/combined_pt.msgpack.gz
+++ b/wordfreq/data/combined_pt.msgpack.gz
--- a/wordfreq/data/combined_ro.msgpack.gz
+++ b/wordfreq/data/combined_ro.msgpack.gz
--- a/wordfreq/data/combined_ru.msgpack.gz
+++ b/wordfreq/data/combined_ru.msgpack.gz
--- a/wordfreq/data/combined_sh.msgpack.gz
+++ b/wordfreq/data/combined_sh.msgpack.gz
--- a/wordfreq/data/combined_sv.msgpack.gz
+++ b/wordfreq/data/combined_sv.msgpack.gz
--- a/wordfreq/data/combined_tr.msgpack.gz
+++ b/wordfreq/data/combined_tr.msgpack.gz
--- a/wordfreq/data/combined_uk.msgpack.gz
+++ b/wordfreq/data/combined_uk.msgpack.gz
--- a/wordfreq/data/combined_zh.msgpack.gz
+++ b/wordfreq/data/combined_zh.msgpack.gz
--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/large_ar.msgpack.gz
+++ b/wordfreq/data/large_ar.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fi.msgpack.gz
+++ b/wordfreq/data/large_fi.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_it.msgpack.gz
+++ b/wordfreq/data/large_it.msgpack.gz
--- a/wordfreq/data/large_ja.msgpack.gz
+++ b/wordfreq/data/large_ja.msgpack.gz
--- a/wordfreq/data/large_nl.msgpack.gz
+++ b/wordfreq/data/large_nl.msgpack.gz
--- a/wordfreq/data/large_pl.msgpack.gz
+++ b/wordfreq/data/large_pl.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/large_ru.msgpack.gz
+++ b/wordfreq/data/large_ru.msgpack.gz
--- a/wordfreq/data/large_zh.msgpack.gz
+++ b/wordfreq/data/large_zh.msgpack.gz
--- a/wordfreq/data/small_ar.msgpack.gz
+++ b/wordfreq/data/small_ar.msgpack.gz
--- a/wordfreq/data/small_bg.msgpack.gz
+++ b/wordfreq/data/small_bg.msgpack.gz
--- a/wordfreq/data/small_bn.msgpack.gz
+++ b/wordfreq/data/small_bn.msgpack.gz
--- a/wordfreq/data/small_ca.msgpack.gz
+++ b/wordfreq/data/small_ca.msgpack.gz
--- a/wordfreq/data/small_cs.msgpack.gz
+++ b/wordfreq/data/small_cs.msgpack.gz
--- a/wordfreq/data/small_da.msgpack.gz
+++ b/wordfreq/data/small_da.msgpack.gz
--- a/wordfreq/data/small_de.msgpack.gz
+++ b/wordfreq/data/small_de.msgpack.gz
--- a/wordfreq/data/small_el.msgpack.gz
+++ b/wordfreq/data/small_el.msgpack.gz
--- a/wordfreq/data/small_en.msgpack.gz
+++ b/wordfreq/data/small_en.msgpack.gz
--- a/wordfreq/data/small_es.msgpack.gz
+++ b/wordfreq/data/small_es.msgpack.gz
--- a/wordfreq/data/small_fa.msgpack.gz
+++ b/wordfreq/data/small_fa.msgpack.gz
--- a/wordfreq/data/small_fi.msgpack.gz
+++ b/wordfreq/data/small_fi.msgpack.gz
--- a/wordfreq/data/small_fr.msgpack.gz
+++ b/wordfreq/data/small_fr.msgpack.gz
--- a/wordfreq/data/small_he.msgpack.gz
+++ b/wordfreq/data/small_he.msgpack.gz
--- a/wordfreq/data/small_hi.msgpack.gz
+++ b/wordfreq/data/small_hi.msgpack.gz
--- a/wordfreq/data/small_hu.msgpack.gz
+++ b/wordfreq/data/small_hu.msgpack.gz
--- a/wordfreq/data/small_id.msgpack.gz
+++ b/wordfreq/data/small_id.msgpack.gz
--- a/wordfreq/data/small_it.msgpack.gz
+++ b/wordfreq/data/small_it.msgpack.gz
--- a/wordfreq/data/small_ja.msgpack.gz
+++ b/wordfreq/data/small_ja.msgpack.gz
--- a/wordfreq/data/small_ko.msgpack.gz
+++ b/wordfreq/data/small_ko.msgpack.gz
--- a/wordfreq/data/small_mk.msgpack.gz
+++ b/wordfreq/data/small_mk.msgpack.gz
--- a/wordfreq/data/small_ms.msgpack.gz
+++ b/wordfreq/data/small_ms.msgpack.gz
--- a/wordfreq/data/small_nb.msgpack.gz
+++ b/wordfreq/data/small_nb.msgpack.gz
--- a/wordfreq/data/small_nl.msgpack.gz
+++ b/wordfreq/data/small_nl.msgpack.gz
--- a/wordfreq/data/small_pl.msgpack.gz
+++ b/wordfreq/data/small_pl.msgpack.gz
--- a/wordfreq/data/small_pt.msgpack.gz
+++ b/wordfreq/data/small_pt.msgpack.gz
--- a/wordfreq/data/small_ro.msgpack.gz
+++ b/wordfreq/data/small_ro.msgpack.gz
--- a/wordfreq/data/small_ru.msgpack.gz
+++ b/wordfreq/data/small_ru.msgpack.gz
--- a/wordfreq/data/small_sh.msgpack.gz
+++ b/wordfreq/data/small_sh.msgpack.gz
--- a/wordfreq/data/small_sv.msgpack.gz
+++ b/wordfreq/data/small_sv.msgpack.gz
--- a/wordfreq/data/small_tr.msgpack.gz
+++ b/wordfreq/data/small_tr.msgpack.gz
--- a/wordfreq/data/small_uk.msgpack.gz
+++ b/wordfreq/data/small_uk.msgpack.gz
--- a/wordfreq/data/small_zh.msgpack.gz
+++ b/wordfreq/data/small_zh.msgpack.gz
--- a/wordfreq/data/twitter_ar.msgpack.gz
+++ b/wordfreq/data/twitter_ar.msgpack.gz
--- a/wordfreq/data/twitter_bg.msgpack.gz
+++ b/wordfreq/data/twitter_bg.msgpack.gz
--- a/wordfreq/data/twitter_bn.msgpack.gz
+++ b/wordfreq/data/twitter_bn.msgpack.gz
--- a/wordfreq/data/twitter_ca.msgpack.gz
+++ b/wordfreq/data/twitter_ca.msgpack.gz
--- a/wordfreq/data/twitter_cs.msgpack.gz
+++ b/wordfreq/data/twitter_cs.msgpack.gz
--- a/wordfreq/data/twitter_da.msgpack.gz
+++ b/wordfreq/data/twitter_da.msgpack.gz
--- a/wordfreq/data/twitter_de.msgpack.gz
+++ b/wordfreq/data/twitter_de.msgpack.gz
--- a/wordfreq/data/twitter_en.msgpack.gz
+++ b/wordfreq/data/twitter_en.msgpack.gz
--- a/wordfreq/data/twitter_es.msgpack.gz
+++ b/wordfreq/data/twitter_es.msgpack.gz
--- a/wordfreq/data/twitter_fa.msgpack.gz
+++ b/wordfreq/data/twitter_fa.msgpack.gz
--- a/wordfreq/data/twitter_fi.msgpack.gz
+++ b/wordfreq/data/twitter_fi.msgpack.gz
--- a/Show More
+++ b/Show More