mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #55 from LuminosoInsight/version2
Version 2, with standalone text pre-processing
This commit is contained in:
commit
18f176dbf6
63
CHANGELOG.md
63
CHANGELOG.md
@ -1,3 +1,66 @@
|
||||
## Version 2.0 (2018-03-14)
|
||||
|
||||
The big change in this version is that text preprocessing, tokenization, and
|
||||
postprocessing to look up words in a list are separate steps.
|
||||
|
||||
If all you need is preprocessing to make text more consistent, use
|
||||
`wordfreq.preprocess.preprocess_text(text, lang)`. If you need preprocessing
|
||||
and tokenization, use `wordfreq.tokenize(text, lang)` as before. If you need
|
||||
all three steps, use the new function `wordfreq.lossy_tokenize(text, lang)`.
|
||||
|
||||
As a breaking change, this means that the `tokenize` function no longer has
|
||||
the `combine_numbers` option, because that's a postprocessing step. For
|
||||
the same behavior, use `lossy_tokenize`, which always combines numbers.
|
||||
|
||||
Similarly, `tokenize` will no longer replace Chinese characters with their
|
||||
Simplified Chinese version, while `lossy_tokenize` will.
|
||||
|
||||
Other changes:
|
||||
|
||||
- There's a new default wordlist for each language, called "best". This
|
||||
chooses the "large" wordlist for that language, or if that list doesn't
|
||||
exist, it falls back on "small".
|
||||
|
||||
- The wordlist formerly named "combined" (this name made sense long ago)
|
||||
is now named "small". "combined" remains as a deprecated alias.
|
||||
|
||||
- The "twitter" wordlist has been removed. If you need to compare word
|
||||
frequencies from individual sources, you can work with the separate files in
|
||||
[exquisite-corpus][].
|
||||
|
||||
- Tokenizing Chinese will preserve the original characters, no matter whether
|
||||
they are Simplified or Traditional, instead of replacing them all with
|
||||
Simplified characters.
|
||||
|
||||
- Different languages require different processing steps, and the decisions
|
||||
about what these steps are now appear in the `wordfreq.language_info` module,
|
||||
replacing a bunch of scattered and inconsistent `if` statements.
|
||||
|
||||
- Tokenizing CJK languages while preserving punctuation now has a less confusing
|
||||
implementation.
|
||||
|
||||
- The preprocessing step can transliterate Azerbaijani, although we don't yet
|
||||
have wordlists in this language. This is similar to how the tokenizer
|
||||
supports many more languages than the ones with wordlists, making future
|
||||
wordlists possible.
|
||||
|
||||
- Speaking of that, the tokenizer will log a warning (once) if you ask to tokenize
|
||||
text written in a script we can't tokenize (such as Thai).
|
||||
|
||||
- New source data from [exquisite-corpus][] includes OPUS OpenSubtitles 2018.
|
||||
|
||||
Nitty gritty dependency changes:
|
||||
|
||||
- Updated the regex dependency to 2018.02.21. (We would love suggestions on
|
||||
how to coexist with other libraries that use other versions of `regex`,
|
||||
without a `>=` requirement that could introduce unexpected data-altering
|
||||
changes.)
|
||||
|
||||
- We now depend on `msgpack`, the new name for `msgpack-python`.
|
||||
|
||||
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
|
||||
|
||||
|
||||
## Version 1.7.0 (2017-08-25)
|
||||
|
||||
- Tokenization will always keep Unicode graphemes together, including
|
||||
|
53
README.md
53
README.md
@ -7,7 +7,7 @@ Author: Robyn Speer
|
||||
## Installation
|
||||
|
||||
wordfreq requires Python 3 and depends on a few other Python modules
|
||||
(msgpack, langcodes, and ftfy). You can install it and its dependencies
|
||||
(msgpack, langcodes, and regex). You can install it and its dependencies
|
||||
in the usual way, either by getting it from pip:
|
||||
|
||||
pip3 install wordfreq
|
||||
@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 27 languages (see *Supported languages* below).
|
||||
used, in 35 languages (see *Supported languages* below).
|
||||
|
||||
It provides three kinds of pre-built wordlists:
|
||||
It provides both 'small' and 'large' wordlists:
|
||||
|
||||
- `'combined'` lists, containing words that appear at least once per
|
||||
million words, averaged across all data sources.
|
||||
- `'twitter'` lists, containing words that appear at least once per
|
||||
million words on Twitter alone.
|
||||
- `'large'` lists, containing words that appear at least once per 100
|
||||
million words, averaged across all data sources.
|
||||
- The 'small' lists take up very little memory and cover words that appear at
|
||||
least once per million words.
|
||||
- The 'large' lists cover words that appear at least once per 100 million
|
||||
words.
|
||||
|
||||
The most straightforward function is:
|
||||
The default list is 'best', which uses 'large' if it's available for the
|
||||
language, and 'small' otherwise.
|
||||
|
||||
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
||||
The most straightforward function for looking up frequencies is:
|
||||
|
||||
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
||||
|
||||
This function looks up a word's frequency in the given language, returning its
|
||||
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
||||
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
|
||||
11.748975549395302
|
||||
|
||||
>>> word_frequency('café', 'en') * 1e6
|
||||
3.981071705534969
|
||||
3.890451449942805
|
||||
|
||||
>>> word_frequency('cafe', 'fr') * 1e6
|
||||
1.4125375446227555
|
||||
1.4454397707459279
|
||||
|
||||
>>> word_frequency('café', 'fr') * 1e6
|
||||
53.70317963702532
|
||||
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
|
||||
|
||||
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
||||
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
|
||||
'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
|
||||
for words that do not appear in the given wordlist, although it should mean
|
||||
one occurrence per billion words.
|
||||
|
||||
>>> from wordfreq import zipf_frequency
|
||||
>>> zipf_frequency('the', 'en')
|
||||
7.75
|
||||
7.77
|
||||
|
||||
>>> zipf_frequency('word', 'en')
|
||||
5.32
|
||||
|
||||
>>> zipf_frequency('frequency', 'en')
|
||||
4.36
|
||||
4.38
|
||||
|
||||
>>> zipf_frequency('zipf', 'en')
|
||||
0.0
|
||||
1.32
|
||||
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
||||
1.28
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||
0.0
|
||||
|
||||
|
||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
||||
|
||||
- `wordlist`: which set of word frequencies to use. Current options are
|
||||
'combined', 'twitter', and 'large'.
|
||||
'small', 'large', and 'best'.
|
||||
|
||||
- `minimum`: If the word is not in the list or has a frequency lower than
|
||||
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
||||
@ -108,7 +109,7 @@ Other functions:
|
||||
way that the words in wordfreq's data were counted in the first place. See
|
||||
*Tokenization*.
|
||||
|
||||
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
||||
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
||||
the list, in descending frequency order.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
@ -118,18 +119,18 @@ the list, in descending frequency order.
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||
|
||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
|
||||
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
|
||||
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||
words and don't need the wrapper that `word_frequency` provides.
|
||||
|
||||
`supported_languages(wordlist='combined')` returns a dictionary whose keys are
|
||||
`supported_languages(wordlist='best')` returns a dictionary whose keys are
|
||||
language codes, and whose values are the data file that will be loaded to
|
||||
provide the requested wordlist in each language.
|
||||
|
||||
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
|
||||
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||
will select each random word from 2^n words.
|
||||
|
||||
@ -256,7 +257,7 @@ into multiple tokens:
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.35
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.55
|
||||
3.54
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
|
5
setup.cfg
Normal file
5
setup.cfg
Normal file
@ -0,0 +1,5 @@
|
||||
[nosetests]
|
||||
verbosity=2
|
||||
with-doctest=1
|
||||
with-coverage=0
|
||||
cover-package=wordfreq
|
4
setup.py
4
setup.py
@ -28,7 +28,7 @@ README_contents = open(os.path.join(current_dir, 'README.md'),
|
||||
encoding='utf-8').read()
|
||||
doclines = README_contents.split("\n")
|
||||
dependencies = [
|
||||
'ftfy >= 5', 'msgpack', 'langcodes >= 1.4', 'regex == 2017.07.28'
|
||||
'msgpack', 'langcodes >= 1.4.1', 'regex == 2018.02.21'
|
||||
]
|
||||
if sys.version_info < (3, 4):
|
||||
dependencies.append('pathlib')
|
||||
@ -36,7 +36,7 @@ if sys.version_info < (3, 4):
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='1.7.0',
|
||||
version='2.0',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||
|
@ -1,9 +1,9 @@
|
||||
from wordfreq import (
|
||||
word_frequency, available_languages, cB_to_freq,
|
||||
top_n_list, random_words, random_ascii_words, tokenize
|
||||
top_n_list, random_words, random_ascii_words, tokenize, lossy_tokenize
|
||||
)
|
||||
from nose.tools import (
|
||||
eq_, assert_almost_equal, assert_greater, raises
|
||||
eq_, assert_almost_equal, assert_greater, raises, assert_not_equal
|
||||
)
|
||||
|
||||
|
||||
@ -15,35 +15,29 @@ def test_freq_examples():
|
||||
assert_greater(word_frequency('de', 'es'),
|
||||
word_frequency('the', 'es'))
|
||||
|
||||
|
||||
# To test the reasonableness of the Twitter list, we want to look up a
|
||||
# common word representing laughter in each language. The default for
|
||||
# languages not listed here is 'haha'.
|
||||
LAUGHTER_WORDS = {
|
||||
'en': 'lol',
|
||||
'hi': 'lol',
|
||||
'cs': 'lol',
|
||||
'ru': 'лол',
|
||||
'zh': '笑',
|
||||
'ja': '笑',
|
||||
'ar': 'ﻪﻬﻬﻬﻫ',
|
||||
'fa': 'خخخخ',
|
||||
'ca': 'jaja',
|
||||
'es': 'jaja',
|
||||
'fr': 'ptdr',
|
||||
'pt': 'kkkk',
|
||||
'he': 'חחח',
|
||||
'bg': 'ахаха',
|
||||
'uk': 'хаха',
|
||||
'bn': 'হা হা',
|
||||
'mk': 'хаха'
|
||||
}
|
||||
# We get word frequencies from the 'large' list when available
|
||||
assert_greater(word_frequency('infrequency', 'en'), 0.)
|
||||
|
||||
|
||||
def test_languages():
|
||||
# Make sure the number of available languages doesn't decrease
|
||||
# Make sure we get all the languages when looking for the default
|
||||
# 'best' wordlist
|
||||
avail = available_languages()
|
||||
assert_greater(len(avail), 26)
|
||||
assert_greater(len(avail), 32)
|
||||
|
||||
# 'small' covers the same languages, but with some different lists
|
||||
avail_small = available_languages('small')
|
||||
eq_(len(avail_small), len(avail))
|
||||
assert_not_equal(avail_small, avail)
|
||||
|
||||
# 'combined' is the same as 'small'
|
||||
avail_old_name = available_languages('combined')
|
||||
eq_(avail_old_name, avail_small)
|
||||
|
||||
# 'large' covers fewer languages
|
||||
avail_large = available_languages('large')
|
||||
assert_greater(len(avail_large), 12)
|
||||
assert_greater(len(avail), len(avail_large))
|
||||
|
||||
# Look up the digit '2' in the main word list for each language
|
||||
for lang in avail:
|
||||
@ -55,17 +49,6 @@ def test_languages():
|
||||
assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code)
|
||||
|
||||
|
||||
def test_twitter():
|
||||
avail = available_languages('twitter')
|
||||
assert_greater(len(avail), 15)
|
||||
|
||||
for lang in avail:
|
||||
assert_greater(word_frequency('rt', lang, 'twitter'),
|
||||
word_frequency('rt', lang, 'combined'))
|
||||
text = LAUGHTER_WORDS.get(lang, 'haha')
|
||||
assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
|
||||
|
||||
|
||||
def test_minimums():
|
||||
eq_(word_frequency('esquivalience', 'en'), 0)
|
||||
eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6)
|
||||
@ -164,13 +147,13 @@ def test_casefolding():
|
||||
def test_number_smashing():
|
||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
|
||||
['715', 'crσσks', 'by', 'bon', 'iver'])
|
||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True),
|
||||
eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
|
||||
['000', 'crσσks', 'by', 'bon', 'iver'])
|
||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True, include_punctuation=True),
|
||||
eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', include_punctuation=True),
|
||||
['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver'])
|
||||
eq_(tokenize('1', 'en', combine_numbers=True), ['1'])
|
||||
eq_(tokenize('3.14', 'en', combine_numbers=True), ['0.00'])
|
||||
eq_(tokenize('24601', 'en', combine_numbers=True), ['00000'])
|
||||
eq_(lossy_tokenize('1', 'en'), ['1'])
|
||||
eq_(lossy_tokenize('3.14', 'en'), ['0.00'])
|
||||
eq_(lossy_tokenize('24601', 'en'), ['00000'])
|
||||
eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en'))
|
||||
|
||||
|
||||
@ -231,6 +214,7 @@ def test_ideographic_fallback():
|
||||
['ひらがな', 'カタカナ', 'romaji']
|
||||
)
|
||||
|
||||
|
||||
def test_other_languages():
|
||||
# Test that we leave Thai letters stuck together. If we had better Thai support,
|
||||
# we would actually split this into a three-word phrase.
|
||||
|
@ -55,10 +55,19 @@ def test_tokens():
|
||||
]
|
||||
)
|
||||
|
||||
# You match the same tokens if you look it up in Traditional Chinese.
|
||||
eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
|
||||
# Check that Traditional Chinese works at all
|
||||
assert_greater(word_frequency(fact_traditional, 'zh'), 0)
|
||||
|
||||
# You get the same token lengths if you look it up in Traditional Chinese,
|
||||
# but the words are different
|
||||
simp_tokens = tokenize(fact_simplified, 'zh', include_punctuation=True)
|
||||
trad_tokens = tokenize(fact_traditional, 'zh', include_punctuation=True)
|
||||
eq_(''.join(simp_tokens), fact_simplified)
|
||||
eq_(''.join(trad_tokens), fact_traditional)
|
||||
simp_lengths = [len(token) for token in simp_tokens]
|
||||
trad_lengths = [len(token) for token in trad_tokens]
|
||||
eq_(simp_lengths, trad_lengths)
|
||||
|
||||
|
||||
def test_combination():
|
||||
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
|
||||
@ -83,5 +92,3 @@ def test_alternate_codes():
|
||||
# Separate codes for Mandarin and Cantonese
|
||||
eq_(tokenize('谢谢谢谢', 'cmn'), tokens)
|
||||
eq_(tokenize('谢谢谢谢', 'yue'), tokens)
|
||||
|
||||
|
||||
|
@ -1,5 +1,6 @@
|
||||
from nose.tools import eq_
|
||||
from wordfreq import tokenize
|
||||
from wordfreq.preprocess import preprocess_text
|
||||
|
||||
|
||||
def test_transliteration():
|
||||
@ -10,6 +11,21 @@ def test_transliteration():
|
||||
eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'),
|
||||
['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš'])
|
||||
|
||||
# I don't have examples of complete sentences in Azerbaijani that are
|
||||
# naturally in Cyrillic, because it turns out everyone writes Azerbaijani
|
||||
# in Latin letters on the Internet, _except_ sometimes for Wiktionary.
|
||||
# So here are some individual words.
|
||||
|
||||
# 'library' in Azerbaijani Cyrillic
|
||||
eq_(preprocess_text('китабхана', 'az'), 'kitabxana')
|
||||
eq_(preprocess_text('КИТАБХАНА', 'az'), 'kitabxana')
|
||||
eq_(preprocess_text('KİTABXANA', 'az'), 'kitabxana')
|
||||
|
||||
# 'scream' in Azerbaijani Cyrillic
|
||||
eq_(preprocess_text('бағырты', 'az'), 'bağırtı')
|
||||
eq_(preprocess_text('БАҒЫРТЫ', 'az'), 'bağırtı')
|
||||
eq_(preprocess_text('BAĞIRTI', 'az'), 'bağırtı')
|
||||
|
||||
|
||||
def test_actually_russian():
|
||||
# This looks mostly like Serbian, but was probably actually Russian.
|
@ -1,4 +1,3 @@
|
||||
from wordfreq.tokens import tokenize, simple_tokenize
|
||||
from pkg_resources import resource_filename
|
||||
from functools import lru_cache
|
||||
import langcodes
|
||||
@ -10,18 +9,15 @@ import random
|
||||
import logging
|
||||
import math
|
||||
|
||||
from .tokens import tokenize, simple_tokenize, lossy_tokenize
|
||||
from .language_info import get_language_info
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
CACHE_SIZE = 100000
|
||||
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
|
||||
|
||||
# Chinese and Japanese are written without spaces. In Chinese, in particular,
|
||||
# we have to infer word boundaries from the frequencies of the words they
|
||||
# would create. When this happens, we should adjust the resulting frequency
|
||||
# to avoid creating a bias toward improbable word combinations.
|
||||
INFERRED_SPACE_LANGUAGES = {'zh'}
|
||||
|
||||
# We'll divide the frequency by 10 for each token boundary that was inferred.
|
||||
# (We determined the factor of 10 empirically by looking at words in the
|
||||
# Chinese wordlist that weren't common enough to be identified by the
|
||||
@ -30,8 +26,9 @@ INFERRED_SPACE_LANGUAGES = {'zh'}
|
||||
# frequency.)
|
||||
INFERRED_SPACE_FACTOR = 10.0
|
||||
|
||||
# simple_tokenize is imported so that other things can import it from here.
|
||||
# Suppress the pyflakes warning.
|
||||
# tokenize and simple_tokenize are imported so that other things can import
|
||||
# them from here. Suppress the pyflakes warning.
|
||||
tokenize = tokenize
|
||||
simple_tokenize = simple_tokenize
|
||||
|
||||
|
||||
@ -87,11 +84,21 @@ def read_cBpack(filename):
|
||||
return data[1:]
|
||||
|
||||
|
||||
def available_languages(wordlist='combined'):
|
||||
def available_languages(wordlist='best'):
|
||||
"""
|
||||
List the languages (as language-code strings) that the wordlist of a given
|
||||
name is available in.
|
||||
Given a wordlist name, return a dictionary of language codes to filenames,
|
||||
representing all the languages in which that wordlist is available.
|
||||
"""
|
||||
if wordlist == 'best':
|
||||
available = available_languages('small')
|
||||
available.update(available_languages('large'))
|
||||
return available
|
||||
elif wordlist == 'combined':
|
||||
logger.warning(
|
||||
"The 'combined' wordlists have been renamed to 'small'."
|
||||
)
|
||||
wordlist = 'small'
|
||||
|
||||
available = {}
|
||||
for path in DATA_PATH.glob('*.msgpack.gz'):
|
||||
if not path.name.startswith('_'):
|
||||
@ -103,7 +110,7 @@ def available_languages(wordlist='combined'):
|
||||
|
||||
|
||||
@lru_cache(maxsize=None)
|
||||
def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
|
||||
def get_frequency_list(lang, wordlist='best', match_cutoff=30):
|
||||
"""
|
||||
Read the raw data from a wordlist file, returning it as a list of
|
||||
lists. (See `read_cBpack` for what this represents.)
|
||||
@ -117,7 +124,8 @@ def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
|
||||
best, score = langcodes.best_match(lang, list(available),
|
||||
min_score=match_cutoff)
|
||||
if score == 0:
|
||||
raise LookupError("No wordlist available for language %r" % lang)
|
||||
raise LookupError("No wordlist %r available for language %r"
|
||||
% (wordlist, lang))
|
||||
|
||||
if best != lang:
|
||||
logger.warning(
|
||||
@ -184,7 +192,7 @@ def freq_to_zipf(freq):
|
||||
|
||||
|
||||
@lru_cache(maxsize=None)
|
||||
def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
|
||||
def get_frequency_dict(lang, wordlist='best', match_cutoff=30):
|
||||
"""
|
||||
Get a word frequency list as a dictionary, mapping tokens to
|
||||
frequencies as floating-point probabilities.
|
||||
@ -198,7 +206,7 @@ def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
|
||||
return freqs
|
||||
|
||||
|
||||
def iter_wordlist(lang, wordlist='combined'):
|
||||
def iter_wordlist(lang, wordlist='best'):
|
||||
"""
|
||||
Yield the words in a wordlist in approximate descending order of
|
||||
frequency.
|
||||
@ -215,8 +223,9 @@ def iter_wordlist(lang, wordlist='combined'):
|
||||
# it takes to look up frequencies from scratch, so something faster is needed.
|
||||
_wf_cache = {}
|
||||
|
||||
|
||||
def _word_frequency(word, lang, wordlist, minimum):
|
||||
tokens = tokenize(word, lang, combine_numbers=True)
|
||||
tokens = lossy_tokenize(word, lang)
|
||||
if not tokens:
|
||||
return minimum
|
||||
|
||||
@ -234,39 +243,31 @@ def _word_frequency(word, lang, wordlist, minimum):
|
||||
|
||||
freq = 1.0 / one_over_result
|
||||
|
||||
if lang in INFERRED_SPACE_LANGUAGES:
|
||||
if get_language_info(lang)['tokenizer'] == 'jieba':
|
||||
# If we used the Jieba tokenizer, we could tokenize anything to match
|
||||
# our wordlist, even nonsense. To counteract this, we multiply by a
|
||||
# probability for each word break that was inferred.
|
||||
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
|
||||
|
||||
return max(freq, minimum)
|
||||
|
||||
|
||||
def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
def word_frequency(word, lang, wordlist='best', minimum=0.):
|
||||
"""
|
||||
Get the frequency of `word` in the language with code `lang`, from the
|
||||
specified `wordlist`. The default wordlist is 'combined', built from
|
||||
whichever of these five sources have sufficient data for the language:
|
||||
specified `wordlist`.
|
||||
|
||||
- Full text of Wikipedia
|
||||
- A sample of 72 million tweets collected from Twitter in 2014,
|
||||
divided roughly into languages using automatic language detection
|
||||
- Frequencies extracted from OpenSubtitles
|
||||
- The Leeds Internet Corpus
|
||||
- Google Books Syntactic Ngrams 2013
|
||||
These wordlists can be specified:
|
||||
|
||||
Another available wordlist is 'twitter', which uses only the data from
|
||||
Twitter.
|
||||
|
||||
Words that we believe occur at least once per million tokens, based on
|
||||
the average of these lists, will appear in the word frequency list.
|
||||
- 'large': a wordlist built from at least 5 sources, containing word
|
||||
frequencies of 10^-8 and higher
|
||||
- 'small': a wordlist built from at least 3 sources, containing word
|
||||
frquencies of 10^-6 and higher
|
||||
- 'best': uses 'large' if available, and 'small' otherwise
|
||||
|
||||
The value returned will always be at least as large as `minimum`.
|
||||
|
||||
If a word decomposes into multiple tokens, we'll return a smoothed estimate
|
||||
of the word frequency that is no greater than the frequency of any of its
|
||||
individual tokens.
|
||||
|
||||
It should be noted that the current tokenizer does not support
|
||||
multi-word Chinese phrases.
|
||||
You could set this value to 10^-8, for example, to return 10^-8 for
|
||||
unknown words in the 'large' list instead of 0, avoiding a discontinuity.
|
||||
"""
|
||||
args = (word, lang, wordlist, minimum)
|
||||
try:
|
||||
@ -278,7 +279,7 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
return _wf_cache[args]
|
||||
|
||||
|
||||
def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
def zipf_frequency(word, lang, wordlist='best', minimum=0.):
|
||||
"""
|
||||
Get the frequency of `word`, in the language with code `lang`, on the Zipf
|
||||
scale.
|
||||
@ -306,7 +307,7 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
|
||||
|
||||
@lru_cache(maxsize=100)
|
||||
def top_n_list(lang, n, wordlist='combined', ascii_only=False):
|
||||
def top_n_list(lang, n, wordlist='best', ascii_only=False):
|
||||
"""
|
||||
Return a frequency list of length `n` in descending order of frequency.
|
||||
This list contains words from `wordlist`, of the given language.
|
||||
@ -321,7 +322,7 @@ def top_n_list(lang, n, wordlist='combined', ascii_only=False):
|
||||
return results
|
||||
|
||||
|
||||
def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
|
||||
def random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12,
|
||||
ascii_only=False):
|
||||
"""
|
||||
Returns a string of random, space separated words.
|
||||
@ -346,7 +347,7 @@ def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
|
||||
return ' '.join([random.choice(choices) for i in range(nwords)])
|
||||
|
||||
|
||||
def random_ascii_words(lang='en', wordlist='combined', nwords=5,
|
||||
def random_ascii_words(lang='en', wordlist='best', nwords=5,
|
||||
bits_per_word=12):
|
||||
"""
|
||||
Returns a string of random, space separated, ASCII words.
|
||||
|
@ -49,4 +49,11 @@ def jieba_tokenize(text, external_wordlist=False):
|
||||
else:
|
||||
if jieba_tokenizer is None:
|
||||
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
|
||||
return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
|
||||
|
||||
# Tokenize the Simplified Chinese version of the text, but return
|
||||
# those spans from the original text, even if it's in Traditional
|
||||
# Chinese
|
||||
tokens = []
|
||||
for _token, start, end in jieba_tokenizer.tokenize(simplify_chinese(text), HMM=False):
|
||||
tokens.append(text[start:end])
|
||||
return tokens
|
||||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/small_ar.msgpack.gz
Normal file
BIN
wordfreq/data/small_ar.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_bg.msgpack.gz
Normal file
BIN
wordfreq/data/small_bg.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_bn.msgpack.gz
Normal file
BIN
wordfreq/data/small_bn.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ca.msgpack.gz
Normal file
BIN
wordfreq/data/small_ca.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_cs.msgpack.gz
Normal file
BIN
wordfreq/data/small_cs.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_da.msgpack.gz
Normal file
BIN
wordfreq/data/small_da.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_de.msgpack.gz
Normal file
BIN
wordfreq/data/small_de.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_el.msgpack.gz
Normal file
BIN
wordfreq/data/small_el.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_en.msgpack.gz
Normal file
BIN
wordfreq/data/small_en.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_es.msgpack.gz
Normal file
BIN
wordfreq/data/small_es.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fa.msgpack.gz
Normal file
BIN
wordfreq/data/small_fa.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fi.msgpack.gz
Normal file
BIN
wordfreq/data/small_fi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fr.msgpack.gz
Normal file
BIN
wordfreq/data/small_fr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_he.msgpack.gz
Normal file
BIN
wordfreq/data/small_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_hi.msgpack.gz
Normal file
BIN
wordfreq/data/small_hi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_hu.msgpack.gz
Normal file
BIN
wordfreq/data/small_hu.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_id.msgpack.gz
Normal file
BIN
wordfreq/data/small_id.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_it.msgpack.gz
Normal file
BIN
wordfreq/data/small_it.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ja.msgpack.gz
Normal file
BIN
wordfreq/data/small_ja.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ko.msgpack.gz
Normal file
BIN
wordfreq/data/small_ko.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_mk.msgpack.gz
Normal file
BIN
wordfreq/data/small_mk.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ms.msgpack.gz
Normal file
BIN
wordfreq/data/small_ms.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_nb.msgpack.gz
Normal file
BIN
wordfreq/data/small_nb.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_nl.msgpack.gz
Normal file
BIN
wordfreq/data/small_nl.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_pl.msgpack.gz
Normal file
BIN
wordfreq/data/small_pl.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_pt.msgpack.gz
Normal file
BIN
wordfreq/data/small_pt.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ro.msgpack.gz
Normal file
BIN
wordfreq/data/small_ro.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ru.msgpack.gz
Normal file
BIN
wordfreq/data/small_ru.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_sh.msgpack.gz
Normal file
BIN
wordfreq/data/small_sh.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_sv.msgpack.gz
Normal file
BIN
wordfreq/data/small_sv.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_tr.msgpack.gz
Normal file
BIN
wordfreq/data/small_tr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_uk.msgpack.gz
Normal file
BIN
wordfreq/data/small_uk.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_zh.msgpack.gz
Normal file
BIN
wordfreq/data/small_zh.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user