Merge pull request #55 from LuminosoInsight/version2

Version 2, with standalone text pre-processing
This commit is contained in:
Lance Nathan 2018-03-15 14:26:49 -04:00 committed by GitHub
commit 18f176dbf6
123 changed files with 26075 additions and 25716 deletions

View File

@ -1,3 +1,66 @@
## Version 2.0 (2018-03-14)
The big change in this version is that text preprocessing, tokenization, and
postprocessing to look up words in a list are separate steps.
If all you need is preprocessing to make text more consistent, use
`wordfreq.preprocess.preprocess_text(text, lang)`. If you need preprocessing
and tokenization, use `wordfreq.tokenize(text, lang)` as before. If you need
all three steps, use the new function `wordfreq.lossy_tokenize(text, lang)`.
As a breaking change, this means that the `tokenize` function no longer has
the `combine_numbers` option, because that's a postprocessing step. For
the same behavior, use `lossy_tokenize`, which always combines numbers.
Similarly, `tokenize` will no longer replace Chinese characters with their
Simplified Chinese version, while `lossy_tokenize` will.
Other changes:
- There's a new default wordlist for each language, called "best". This
chooses the "large" wordlist for that language, or if that list doesn't
exist, it falls back on "small".
- The wordlist formerly named "combined" (this name made sense long ago)
is now named "small". "combined" remains as a deprecated alias.
- The "twitter" wordlist has been removed. If you need to compare word
frequencies from individual sources, you can work with the separate files in
[exquisite-corpus][].
- Tokenizing Chinese will preserve the original characters, no matter whether
they are Simplified or Traditional, instead of replacing them all with
Simplified characters.
- Different languages require different processing steps, and the decisions
about what these steps are now appear in the `wordfreq.language_info` module,
replacing a bunch of scattered and inconsistent `if` statements.
- Tokenizing CJK languages while preserving punctuation now has a less confusing
implementation.
- The preprocessing step can transliterate Azerbaijani, although we don't yet
have wordlists in this language. This is similar to how the tokenizer
supports many more languages than the ones with wordlists, making future
wordlists possible.
- Speaking of that, the tokenizer will log a warning (once) if you ask to tokenize
text written in a script we can't tokenize (such as Thai).
- New source data from [exquisite-corpus][] includes OPUS OpenSubtitles 2018.
Nitty gritty dependency changes:
- Updated the regex dependency to 2018.02.21. (We would love suggestions on
how to coexist with other libraries that use other versions of `regex`,
without a `>=` requirement that could introduce unexpected data-altering
changes.)
- We now depend on `msgpack`, the new name for `msgpack-python`.
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
## Version 1.7.0 (2017-08-25) ## Version 1.7.0 (2017-08-25)
- Tokenization will always keep Unicode graphemes together, including - Tokenization will always keep Unicode graphemes together, including

View File

@ -7,7 +7,7 @@ Author: Robyn Speer
## Installation ## Installation
wordfreq requires Python 3 and depends on a few other Python modules wordfreq requires Python 3 and depends on a few other Python modules
(msgpack, langcodes, and ftfy). You can install it and its dependencies (msgpack, langcodes, and regex). You can install it and its dependencies
in the usual way, either by getting it from pip: in the usual way, either by getting it from pip:
pip3 install wordfreq pip3 install wordfreq
@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
used, in 27 languages (see *Supported languages* below). used, in 35 languages (see *Supported languages* below).
It provides three kinds of pre-built wordlists: It provides both 'small' and 'large' wordlists:
- `'combined'` lists, containing words that appear at least once per - The 'small' lists take up very little memory and cover words that appear at
million words, averaged across all data sources. least once per million words.
- `'twitter'` lists, containing words that appear at least once per - The 'large' lists cover words that appear at least once per 100 million
million words on Twitter alone. words.
- `'large'` lists, containing words that appear at least once per 100
million words, averaged across all data sources.
The most straightforward function is: The default list is 'best', which uses 'large' if it's available for the
language, and 'small' otherwise.
word_frequency(word, lang, wordlist='combined', minimum=0.0) The most straightforward function for looking up frequencies is:
word_frequency(word, lang, wordlist='best', minimum=0.0)
This function looks up a word's frequency in the given language, returning its This function looks up a word's frequency in the given language, returning its
frequency as a decimal between 0 and 1. In these examples, we'll multiply the frequency as a decimal between 0 and 1. In these examples, we'll multiply the
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
11.748975549395302 11.748975549395302
>>> word_frequency('café', 'en') * 1e6 >>> word_frequency('café', 'en') * 1e6
3.981071705534969 3.890451449942805
>>> word_frequency('cafe', 'fr') * 1e6 >>> word_frequency('cafe', 'fr') * 1e6
1.4125375446227555 1.4454397707459279
>>> word_frequency('café', 'fr') * 1e6 >>> word_frequency('café', 'fr') * 1e6
53.70317963702532 53.70317963702532
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
Reasonable Zipf values are between 0 and 8, but because of the cutoffs Reasonable Zipf values are between 0 and 8, but because of the cutoffs
described above, the minimum Zipf value appearing in these lists is 1.0 for the described above, the minimum Zipf value appearing in these lists is 1.0 for the
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value 'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
for words that do not appear in the given wordlist, although it should mean for words that do not appear in the given wordlist, although it should mean
one occurrence per billion words. one occurrence per billion words.
>>> from wordfreq import zipf_frequency >>> from wordfreq import zipf_frequency
>>> zipf_frequency('the', 'en') >>> zipf_frequency('the', 'en')
7.75 7.77
>>> zipf_frequency('word', 'en') >>> zipf_frequency('word', 'en')
5.32 5.32
>>> zipf_frequency('frequency', 'en') >>> zipf_frequency('frequency', 'en')
4.36 4.38
>>> zipf_frequency('zipf', 'en') >>> zipf_frequency('zipf', 'en')
0.0 1.32
>>> zipf_frequency('zipf', 'en', wordlist='large') >>> zipf_frequency('zipf', 'en', wordlist='small')
1.28 0.0
The parameters to `word_frequency` and `zipf_frequency` are: The parameters to `word_frequency` and `zipf_frequency` are:
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'. - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
- `wordlist`: which set of word frequencies to use. Current options are - `wordlist`: which set of word frequencies to use. Current options are
'combined', 'twitter', and 'large'. 'small', 'large', and 'best'.
- `minimum`: If the word is not in the list or has a frequency lower than - `minimum`: If the word is not in the list or has a frequency lower than
`minimum`, return `minimum` instead. You may want to set this to the minimum `minimum`, return `minimum` instead. You may want to set this to the minimum
@ -108,7 +109,7 @@ Other functions:
way that the words in wordfreq's data were counted in the first place. See way that the words in wordfreq's data were counted in the first place. See
*Tokenization*. *Tokenization*.
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in `top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
the list, in descending frequency order. the list, in descending frequency order.
>>> from wordfreq import top_n_list >>> from wordfreq import top_n_list
@ -118,18 +119,18 @@ the list, in descending frequency order.
>>> top_n_list('es', 10) >>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se'] ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a `iter_wordlist(lang, wordlist='best')` iterates through all the words in a
wordlist, in descending frequency order. wordlist, in descending frequency order.
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in `get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides. words and don't need the wrapper that `word_frequency` provides.
`supported_languages(wordlist='combined')` returns a dictionary whose keys are `supported_languages(wordlist='best')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language. provide the requested wordlist in each language.
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)` `random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n` returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words. will select each random word from 2^n words.
@ -256,7 +257,7 @@ into multiple tokens:
>>> zipf_frequency('New York', 'en') >>> zipf_frequency('New York', 'en')
5.35 5.35
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway" >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.55 3.54
The word frequencies are combined with the half-harmonic-mean function in order The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese, to provide an estimate of what their combined frequency would be. In Chinese,

5
setup.cfg Normal file
View File

@ -0,0 +1,5 @@
[nosetests]
verbosity=2
with-doctest=1
with-coverage=0
cover-package=wordfreq

View File

@ -28,7 +28,7 @@ README_contents = open(os.path.join(current_dir, 'README.md'),
encoding='utf-8').read() encoding='utf-8').read()
doclines = README_contents.split("\n") doclines = README_contents.split("\n")
dependencies = [ dependencies = [
'ftfy >= 5', 'msgpack', 'langcodes >= 1.4', 'regex == 2017.07.28' 'msgpack', 'langcodes >= 1.4.1', 'regex == 2018.02.21'
] ]
if sys.version_info < (3, 4): if sys.version_info < (3, 4):
dependencies.append('pathlib') dependencies.append('pathlib')
@ -36,7 +36,7 @@ if sys.version_info < (3, 4):
setup( setup(
name="wordfreq", name="wordfreq",
version='1.7.0', version='2.0',
maintainer='Luminoso Technologies, Inc.', maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com', maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq/', url='http://github.com/LuminosoInsight/wordfreq/',

View File

@ -1,9 +1,9 @@
from wordfreq import ( from wordfreq import (
word_frequency, available_languages, cB_to_freq, word_frequency, available_languages, cB_to_freq,
top_n_list, random_words, random_ascii_words, tokenize top_n_list, random_words, random_ascii_words, tokenize, lossy_tokenize
) )
from nose.tools import ( from nose.tools import (
eq_, assert_almost_equal, assert_greater, raises eq_, assert_almost_equal, assert_greater, raises, assert_not_equal
) )
@ -15,35 +15,29 @@ def test_freq_examples():
assert_greater(word_frequency('de', 'es'), assert_greater(word_frequency('de', 'es'),
word_frequency('the', 'es')) word_frequency('the', 'es'))
# We get word frequencies from the 'large' list when available
# To test the reasonableness of the Twitter list, we want to look up a assert_greater(word_frequency('infrequency', 'en'), 0.)
# common word representing laughter in each language. The default for
# languages not listed here is 'haha'.
LAUGHTER_WORDS = {
'en': 'lol',
'hi': 'lol',
'cs': 'lol',
'ru': 'лол',
'zh': '',
'ja': '',
'ar': '',
'fa': 'خخخخ',
'ca': 'jaja',
'es': 'jaja',
'fr': 'ptdr',
'pt': 'kkkk',
'he': 'חחח',
'bg': 'ахаха',
'uk': 'хаха',
'bn': 'হা হা',
'mk': 'хаха'
}
def test_languages(): def test_languages():
# Make sure the number of available languages doesn't decrease # Make sure we get all the languages when looking for the default
# 'best' wordlist
avail = available_languages() avail = available_languages()
assert_greater(len(avail), 26) assert_greater(len(avail), 32)
# 'small' covers the same languages, but with some different lists
avail_small = available_languages('small')
eq_(len(avail_small), len(avail))
assert_not_equal(avail_small, avail)
# 'combined' is the same as 'small'
avail_old_name = available_languages('combined')
eq_(avail_old_name, avail_small)
# 'large' covers fewer languages
avail_large = available_languages('large')
assert_greater(len(avail_large), 12)
assert_greater(len(avail), len(avail_large))
# Look up the digit '2' in the main word list for each language # Look up the digit '2' in the main word list for each language
for lang in avail: for lang in avail:
@ -55,17 +49,6 @@ def test_languages():
assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code) assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code)
def test_twitter():
avail = available_languages('twitter')
assert_greater(len(avail), 15)
for lang in avail:
assert_greater(word_frequency('rt', lang, 'twitter'),
word_frequency('rt', lang, 'combined'))
text = LAUGHTER_WORDS.get(lang, 'haha')
assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
def test_minimums(): def test_minimums():
eq_(word_frequency('esquivalience', 'en'), 0) eq_(word_frequency('esquivalience', 'en'), 0)
eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6) eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6)
@ -164,13 +147,13 @@ def test_casefolding():
def test_number_smashing(): def test_number_smashing():
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'), eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
['715', 'crσσks', 'by', 'bon', 'iver']) ['715', 'crσσks', 'by', 'bon', 'iver'])
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True), eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
['000', 'crσσks', 'by', 'bon', 'iver']) ['000', 'crσσks', 'by', 'bon', 'iver'])
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True, include_punctuation=True), eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', include_punctuation=True),
['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver']) ['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver'])
eq_(tokenize('1', 'en', combine_numbers=True), ['1']) eq_(lossy_tokenize('1', 'en'), ['1'])
eq_(tokenize('3.14', 'en', combine_numbers=True), ['0.00']) eq_(lossy_tokenize('3.14', 'en'), ['0.00'])
eq_(tokenize('24601', 'en', combine_numbers=True), ['00000']) eq_(lossy_tokenize('24601', 'en'), ['00000'])
eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en')) eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en'))
@ -231,6 +214,7 @@ def test_ideographic_fallback():
['ひらがな', 'カタカナ', 'romaji'] ['ひらがな', 'カタカナ', 'romaji']
) )
def test_other_languages(): def test_other_languages():
# Test that we leave Thai letters stuck together. If we had better Thai support, # Test that we leave Thai letters stuck together. If we had better Thai support,
# we would actually split this into a three-word phrase. # we would actually split this into a three-word phrase.

View File

@ -55,10 +55,19 @@ def test_tokens():
] ]
) )
# You match the same tokens if you look it up in Traditional Chinese. # Check that Traditional Chinese works at all
eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
assert_greater(word_frequency(fact_traditional, 'zh'), 0) assert_greater(word_frequency(fact_traditional, 'zh'), 0)
# You get the same token lengths if you look it up in Traditional Chinese,
# but the words are different
simp_tokens = tokenize(fact_simplified, 'zh', include_punctuation=True)
trad_tokens = tokenize(fact_traditional, 'zh', include_punctuation=True)
eq_(''.join(simp_tokens), fact_simplified)
eq_(''.join(trad_tokens), fact_traditional)
simp_lengths = [len(token) for token in simp_tokens]
trad_lengths = [len(token) for token in trad_tokens]
eq_(simp_lengths, trad_lengths)
def test_combination(): def test_combination():
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks" xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
@ -83,5 +92,3 @@ def test_alternate_codes():
# Separate codes for Mandarin and Cantonese # Separate codes for Mandarin and Cantonese
eq_(tokenize('谢谢谢谢', 'cmn'), tokens) eq_(tokenize('谢谢谢谢', 'cmn'), tokens)
eq_(tokenize('谢谢谢谢', 'yue'), tokens) eq_(tokenize('谢谢谢谢', 'yue'), tokens)

View File

@ -1,5 +1,6 @@
from nose.tools import eq_ from nose.tools import eq_
from wordfreq import tokenize from wordfreq import tokenize
from wordfreq.preprocess import preprocess_text
def test_transliteration(): def test_transliteration():
@ -10,6 +11,21 @@ def test_transliteration():
eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'), eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'),
['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš']) ['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš'])
# I don't have examples of complete sentences in Azerbaijani that are
# naturally in Cyrillic, because it turns out everyone writes Azerbaijani
# in Latin letters on the Internet, _except_ sometimes for Wiktionary.
# So here are some individual words.
# 'library' in Azerbaijani Cyrillic
eq_(preprocess_text('китабхана', 'az'), 'kitabxana')
eq_(preprocess_text('КИТАБХАНА', 'az'), 'kitabxana')
eq_(preprocess_text('KİTABXANA', 'az'), 'kitabxana')
# 'scream' in Azerbaijani Cyrillic
eq_(preprocess_text('бағырты', 'az'), 'bağırtı')
eq_(preprocess_text('БАҒЫРТЫ', 'az'), 'bağırtı')
eq_(preprocess_text('BAĞIRTI', 'az'), 'bağırtı')
def test_actually_russian(): def test_actually_russian():
# This looks mostly like Serbian, but was probably actually Russian. # This looks mostly like Serbian, but was probably actually Russian.

View File

@ -1,4 +1,3 @@
from wordfreq.tokens import tokenize, simple_tokenize
from pkg_resources import resource_filename from pkg_resources import resource_filename
from functools import lru_cache from functools import lru_cache
import langcodes import langcodes
@ -10,18 +9,15 @@ import random
import logging import logging
import math import math
from .tokens import tokenize, simple_tokenize, lossy_tokenize
from .language_info import get_language_info
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
CACHE_SIZE = 100000 CACHE_SIZE = 100000
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data')) DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
# Chinese and Japanese are written without spaces. In Chinese, in particular,
# we have to infer word boundaries from the frequencies of the words they
# would create. When this happens, we should adjust the resulting frequency
# to avoid creating a bias toward improbable word combinations.
INFERRED_SPACE_LANGUAGES = {'zh'}
# We'll divide the frequency by 10 for each token boundary that was inferred. # We'll divide the frequency by 10 for each token boundary that was inferred.
# (We determined the factor of 10 empirically by looking at words in the # (We determined the factor of 10 empirically by looking at words in the
# Chinese wordlist that weren't common enough to be identified by the # Chinese wordlist that weren't common enough to be identified by the
@ -30,8 +26,9 @@ INFERRED_SPACE_LANGUAGES = {'zh'}
# frequency.) # frequency.)
INFERRED_SPACE_FACTOR = 10.0 INFERRED_SPACE_FACTOR = 10.0
# simple_tokenize is imported so that other things can import it from here. # tokenize and simple_tokenize are imported so that other things can import
# Suppress the pyflakes warning. # them from here. Suppress the pyflakes warning.
tokenize = tokenize
simple_tokenize = simple_tokenize simple_tokenize = simple_tokenize
@ -87,11 +84,21 @@ def read_cBpack(filename):
return data[1:] return data[1:]
def available_languages(wordlist='combined'): def available_languages(wordlist='best'):
""" """
List the languages (as language-code strings) that the wordlist of a given Given a wordlist name, return a dictionary of language codes to filenames,
name is available in. representing all the languages in which that wordlist is available.
""" """
if wordlist == 'best':
available = available_languages('small')
available.update(available_languages('large'))
return available
elif wordlist == 'combined':
logger.warning(
"The 'combined' wordlists have been renamed to 'small'."
)
wordlist = 'small'
available = {} available = {}
for path in DATA_PATH.glob('*.msgpack.gz'): for path in DATA_PATH.glob('*.msgpack.gz'):
if not path.name.startswith('_'): if not path.name.startswith('_'):
@ -103,7 +110,7 @@ def available_languages(wordlist='combined'):
@lru_cache(maxsize=None) @lru_cache(maxsize=None)
def get_frequency_list(lang, wordlist='combined', match_cutoff=30): def get_frequency_list(lang, wordlist='best', match_cutoff=30):
""" """
Read the raw data from a wordlist file, returning it as a list of Read the raw data from a wordlist file, returning it as a list of
lists. (See `read_cBpack` for what this represents.) lists. (See `read_cBpack` for what this represents.)
@ -117,7 +124,8 @@ def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
best, score = langcodes.best_match(lang, list(available), best, score = langcodes.best_match(lang, list(available),
min_score=match_cutoff) min_score=match_cutoff)
if score == 0: if score == 0:
raise LookupError("No wordlist available for language %r" % lang) raise LookupError("No wordlist %r available for language %r"
% (wordlist, lang))
if best != lang: if best != lang:
logger.warning( logger.warning(
@ -184,7 +192,7 @@ def freq_to_zipf(freq):
@lru_cache(maxsize=None) @lru_cache(maxsize=None)
def get_frequency_dict(lang, wordlist='combined', match_cutoff=30): def get_frequency_dict(lang, wordlist='best', match_cutoff=30):
""" """
Get a word frequency list as a dictionary, mapping tokens to Get a word frequency list as a dictionary, mapping tokens to
frequencies as floating-point probabilities. frequencies as floating-point probabilities.
@ -198,7 +206,7 @@ def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
return freqs return freqs
def iter_wordlist(lang, wordlist='combined'): def iter_wordlist(lang, wordlist='best'):
""" """
Yield the words in a wordlist in approximate descending order of Yield the words in a wordlist in approximate descending order of
frequency. frequency.
@ -215,8 +223,9 @@ def iter_wordlist(lang, wordlist='combined'):
# it takes to look up frequencies from scratch, so something faster is needed. # it takes to look up frequencies from scratch, so something faster is needed.
_wf_cache = {} _wf_cache = {}
def _word_frequency(word, lang, wordlist, minimum): def _word_frequency(word, lang, wordlist, minimum):
tokens = tokenize(word, lang, combine_numbers=True) tokens = lossy_tokenize(word, lang)
if not tokens: if not tokens:
return minimum return minimum
@ -234,39 +243,31 @@ def _word_frequency(word, lang, wordlist, minimum):
freq = 1.0 / one_over_result freq = 1.0 / one_over_result
if lang in INFERRED_SPACE_LANGUAGES: if get_language_info(lang)['tokenizer'] == 'jieba':
# If we used the Jieba tokenizer, we could tokenize anything to match
# our wordlist, even nonsense. To counteract this, we multiply by a
# probability for each word break that was inferred.
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1) freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
return max(freq, minimum) return max(freq, minimum)
def word_frequency(word, lang, wordlist='combined', minimum=0.): def word_frequency(word, lang, wordlist='best', minimum=0.):
""" """
Get the frequency of `word` in the language with code `lang`, from the Get the frequency of `word` in the language with code `lang`, from the
specified `wordlist`. The default wordlist is 'combined', built from specified `wordlist`.
whichever of these five sources have sufficient data for the language:
- Full text of Wikipedia These wordlists can be specified:
- A sample of 72 million tweets collected from Twitter in 2014,
divided roughly into languages using automatic language detection
- Frequencies extracted from OpenSubtitles
- The Leeds Internet Corpus
- Google Books Syntactic Ngrams 2013
Another available wordlist is 'twitter', which uses only the data from - 'large': a wordlist built from at least 5 sources, containing word
Twitter. frequencies of 10^-8 and higher
- 'small': a wordlist built from at least 3 sources, containing word
Words that we believe occur at least once per million tokens, based on frquencies of 10^-6 and higher
the average of these lists, will appear in the word frequency list. - 'best': uses 'large' if available, and 'small' otherwise
The value returned will always be at least as large as `minimum`. The value returned will always be at least as large as `minimum`.
You could set this value to 10^-8, for example, to return 10^-8 for
If a word decomposes into multiple tokens, we'll return a smoothed estimate unknown words in the 'large' list instead of 0, avoiding a discontinuity.
of the word frequency that is no greater than the frequency of any of its
individual tokens.
It should be noted that the current tokenizer does not support
multi-word Chinese phrases.
""" """
args = (word, lang, wordlist, minimum) args = (word, lang, wordlist, minimum)
try: try:
@ -278,7 +279,7 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
return _wf_cache[args] return _wf_cache[args]
def zipf_frequency(word, lang, wordlist='combined', minimum=0.): def zipf_frequency(word, lang, wordlist='best', minimum=0.):
""" """
Get the frequency of `word`, in the language with code `lang`, on the Zipf Get the frequency of `word`, in the language with code `lang`, on the Zipf
scale. scale.
@ -306,7 +307,7 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
@lru_cache(maxsize=100) @lru_cache(maxsize=100)
def top_n_list(lang, n, wordlist='combined', ascii_only=False): def top_n_list(lang, n, wordlist='best', ascii_only=False):
""" """
Return a frequency list of length `n` in descending order of frequency. Return a frequency list of length `n` in descending order of frequency.
This list contains words from `wordlist`, of the given language. This list contains words from `wordlist`, of the given language.
@ -321,7 +322,7 @@ def top_n_list(lang, n, wordlist='combined', ascii_only=False):
return results return results
def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12, def random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12,
ascii_only=False): ascii_only=False):
""" """
Returns a string of random, space separated words. Returns a string of random, space separated words.
@ -346,7 +347,7 @@ def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
return ' '.join([random.choice(choices) for i in range(nwords)]) return ' '.join([random.choice(choices) for i in range(nwords)])
def random_ascii_words(lang='en', wordlist='combined', nwords=5, def random_ascii_words(lang='en', wordlist='best', nwords=5,
bits_per_word=12): bits_per_word=12):
""" """
Returns a string of random, space separated, ASCII words. Returns a string of random, space separated, ASCII words.

View File

@ -49,4 +49,11 @@ def jieba_tokenize(text, external_wordlist=False):
else: else:
if jieba_tokenizer is None: if jieba_tokenizer is None:
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME) jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
# Tokenize the Simplified Chinese version of the text, but return
# those spans from the original text, even if it's in Traditional
# Chinese
tokens = []
for _token, start, end in jieba_tokenizer.tokenize(simplify_chinese(text), HMM=False):
tokens.append(text[start:end])
return tokens

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Some files were not shown because too many files have changed in this diff Show More