mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #55 from LuminosoInsight/version2
Version 2, with standalone text pre-processing
This commit is contained in:
commit
18f176dbf6
63
CHANGELOG.md
63
CHANGELOG.md
@ -1,3 +1,66 @@
|
|||||||
|
## Version 2.0 (2018-03-14)
|
||||||
|
|
||||||
|
The big change in this version is that text preprocessing, tokenization, and
|
||||||
|
postprocessing to look up words in a list are separate steps.
|
||||||
|
|
||||||
|
If all you need is preprocessing to make text more consistent, use
|
||||||
|
`wordfreq.preprocess.preprocess_text(text, lang)`. If you need preprocessing
|
||||||
|
and tokenization, use `wordfreq.tokenize(text, lang)` as before. If you need
|
||||||
|
all three steps, use the new function `wordfreq.lossy_tokenize(text, lang)`.
|
||||||
|
|
||||||
|
As a breaking change, this means that the `tokenize` function no longer has
|
||||||
|
the `combine_numbers` option, because that's a postprocessing step. For
|
||||||
|
the same behavior, use `lossy_tokenize`, which always combines numbers.
|
||||||
|
|
||||||
|
Similarly, `tokenize` will no longer replace Chinese characters with their
|
||||||
|
Simplified Chinese version, while `lossy_tokenize` will.
|
||||||
|
|
||||||
|
Other changes:
|
||||||
|
|
||||||
|
- There's a new default wordlist for each language, called "best". This
|
||||||
|
chooses the "large" wordlist for that language, or if that list doesn't
|
||||||
|
exist, it falls back on "small".
|
||||||
|
|
||||||
|
- The wordlist formerly named "combined" (this name made sense long ago)
|
||||||
|
is now named "small". "combined" remains as a deprecated alias.
|
||||||
|
|
||||||
|
- The "twitter" wordlist has been removed. If you need to compare word
|
||||||
|
frequencies from individual sources, you can work with the separate files in
|
||||||
|
[exquisite-corpus][].
|
||||||
|
|
||||||
|
- Tokenizing Chinese will preserve the original characters, no matter whether
|
||||||
|
they are Simplified or Traditional, instead of replacing them all with
|
||||||
|
Simplified characters.
|
||||||
|
|
||||||
|
- Different languages require different processing steps, and the decisions
|
||||||
|
about what these steps are now appear in the `wordfreq.language_info` module,
|
||||||
|
replacing a bunch of scattered and inconsistent `if` statements.
|
||||||
|
|
||||||
|
- Tokenizing CJK languages while preserving punctuation now has a less confusing
|
||||||
|
implementation.
|
||||||
|
|
||||||
|
- The preprocessing step can transliterate Azerbaijani, although we don't yet
|
||||||
|
have wordlists in this language. This is similar to how the tokenizer
|
||||||
|
supports many more languages than the ones with wordlists, making future
|
||||||
|
wordlists possible.
|
||||||
|
|
||||||
|
- Speaking of that, the tokenizer will log a warning (once) if you ask to tokenize
|
||||||
|
text written in a script we can't tokenize (such as Thai).
|
||||||
|
|
||||||
|
- New source data from [exquisite-corpus][] includes OPUS OpenSubtitles 2018.
|
||||||
|
|
||||||
|
Nitty gritty dependency changes:
|
||||||
|
|
||||||
|
- Updated the regex dependency to 2018.02.21. (We would love suggestions on
|
||||||
|
how to coexist with other libraries that use other versions of `regex`,
|
||||||
|
without a `>=` requirement that could introduce unexpected data-altering
|
||||||
|
changes.)
|
||||||
|
|
||||||
|
- We now depend on `msgpack`, the new name for `msgpack-python`.
|
||||||
|
|
||||||
|
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
|
||||||
|
|
||||||
|
|
||||||
## Version 1.7.0 (2017-08-25)
|
## Version 1.7.0 (2017-08-25)
|
||||||
|
|
||||||
- Tokenization will always keep Unicode graphemes together, including
|
- Tokenization will always keep Unicode graphemes together, including
|
||||||
|
53
README.md
53
README.md
@ -7,7 +7,7 @@ Author: Robyn Speer
|
|||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
wordfreq requires Python 3 and depends on a few other Python modules
|
wordfreq requires Python 3 and depends on a few other Python modules
|
||||||
(msgpack, langcodes, and ftfy). You can install it and its dependencies
|
(msgpack, langcodes, and regex). You can install it and its dependencies
|
||||||
in the usual way, either by getting it from pip:
|
in the usual way, either by getting it from pip:
|
||||||
|
|
||||||
pip3 install wordfreq
|
pip3 install wordfreq
|
||||||
@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 27 languages (see *Supported languages* below).
|
used, in 35 languages (see *Supported languages* below).
|
||||||
|
|
||||||
It provides three kinds of pre-built wordlists:
|
It provides both 'small' and 'large' wordlists:
|
||||||
|
|
||||||
- `'combined'` lists, containing words that appear at least once per
|
- The 'small' lists take up very little memory and cover words that appear at
|
||||||
million words, averaged across all data sources.
|
least once per million words.
|
||||||
- `'twitter'` lists, containing words that appear at least once per
|
- The 'large' lists cover words that appear at least once per 100 million
|
||||||
million words on Twitter alone.
|
words.
|
||||||
- `'large'` lists, containing words that appear at least once per 100
|
|
||||||
million words, averaged across all data sources.
|
|
||||||
|
|
||||||
The most straightforward function is:
|
The default list is 'best', which uses 'large' if it's available for the
|
||||||
|
language, and 'small' otherwise.
|
||||||
|
|
||||||
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
The most straightforward function for looking up frequencies is:
|
||||||
|
|
||||||
|
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
||||||
|
|
||||||
This function looks up a word's frequency in the given language, returning its
|
This function looks up a word's frequency in the given language, returning its
|
||||||
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
||||||
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
|
|||||||
11.748975549395302
|
11.748975549395302
|
||||||
|
|
||||||
>>> word_frequency('café', 'en') * 1e6
|
>>> word_frequency('café', 'en') * 1e6
|
||||||
3.981071705534969
|
3.890451449942805
|
||||||
|
|
||||||
>>> word_frequency('cafe', 'fr') * 1e6
|
>>> word_frequency('cafe', 'fr') * 1e6
|
||||||
1.4125375446227555
|
1.4454397707459279
|
||||||
|
|
||||||
>>> word_frequency('café', 'fr') * 1e6
|
>>> word_frequency('café', 'fr') * 1e6
|
||||||
53.70317963702532
|
53.70317963702532
|
||||||
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
|
|||||||
|
|
||||||
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
||||||
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||||
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
|
'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
|
||||||
for words that do not appear in the given wordlist, although it should mean
|
for words that do not appear in the given wordlist, although it should mean
|
||||||
one occurrence per billion words.
|
one occurrence per billion words.
|
||||||
|
|
||||||
>>> from wordfreq import zipf_frequency
|
>>> from wordfreq import zipf_frequency
|
||||||
>>> zipf_frequency('the', 'en')
|
>>> zipf_frequency('the', 'en')
|
||||||
7.75
|
7.77
|
||||||
|
|
||||||
>>> zipf_frequency('word', 'en')
|
>>> zipf_frequency('word', 'en')
|
||||||
5.32
|
5.32
|
||||||
|
|
||||||
>>> zipf_frequency('frequency', 'en')
|
>>> zipf_frequency('frequency', 'en')
|
||||||
4.36
|
4.38
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en')
|
>>> zipf_frequency('zipf', 'en')
|
||||||
0.0
|
1.32
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||||
1.28
|
0.0
|
||||||
|
|
||||||
|
|
||||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||||
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
|||||||
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
||||||
|
|
||||||
- `wordlist`: which set of word frequencies to use. Current options are
|
- `wordlist`: which set of word frequencies to use. Current options are
|
||||||
'combined', 'twitter', and 'large'.
|
'small', 'large', and 'best'.
|
||||||
|
|
||||||
- `minimum`: If the word is not in the list or has a frequency lower than
|
- `minimum`: If the word is not in the list or has a frequency lower than
|
||||||
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
||||||
@ -108,7 +109,7 @@ Other functions:
|
|||||||
way that the words in wordfreq's data were counted in the first place. See
|
way that the words in wordfreq's data were counted in the first place. See
|
||||||
*Tokenization*.
|
*Tokenization*.
|
||||||
|
|
||||||
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
||||||
the list, in descending frequency order.
|
the list, in descending frequency order.
|
||||||
|
|
||||||
>>> from wordfreq import top_n_list
|
>>> from wordfreq import top_n_list
|
||||||
@ -118,18 +119,18 @@ the list, in descending frequency order.
|
|||||||
>>> top_n_list('es', 10)
|
>>> top_n_list('es', 10)
|
||||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||||
|
|
||||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
||||||
wordlist, in descending frequency order.
|
wordlist, in descending frequency order.
|
||||||
|
|
||||||
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
|
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
||||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||||
words and don't need the wrapper that `word_frequency` provides.
|
words and don't need the wrapper that `word_frequency` provides.
|
||||||
|
|
||||||
`supported_languages(wordlist='combined')` returns a dictionary whose keys are
|
`supported_languages(wordlist='best')` returns a dictionary whose keys are
|
||||||
language codes, and whose values are the data file that will be loaded to
|
language codes, and whose values are the data file that will be loaded to
|
||||||
provide the requested wordlist in each language.
|
provide the requested wordlist in each language.
|
||||||
|
|
||||||
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
|
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||||
will select each random word from 2^n words.
|
will select each random word from 2^n words.
|
||||||
|
|
||||||
@ -256,7 +257,7 @@ into multiple tokens:
|
|||||||
>>> zipf_frequency('New York', 'en')
|
>>> zipf_frequency('New York', 'en')
|
||||||
5.35
|
5.35
|
||||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
3.55
|
3.54
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||||
|
5
setup.cfg
Normal file
5
setup.cfg
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
[nosetests]
|
||||||
|
verbosity=2
|
||||||
|
with-doctest=1
|
||||||
|
with-coverage=0
|
||||||
|
cover-package=wordfreq
|
4
setup.py
4
setup.py
@ -28,7 +28,7 @@ README_contents = open(os.path.join(current_dir, 'README.md'),
|
|||||||
encoding='utf-8').read()
|
encoding='utf-8').read()
|
||||||
doclines = README_contents.split("\n")
|
doclines = README_contents.split("\n")
|
||||||
dependencies = [
|
dependencies = [
|
||||||
'ftfy >= 5', 'msgpack', 'langcodes >= 1.4', 'regex == 2017.07.28'
|
'msgpack', 'langcodes >= 1.4.1', 'regex == 2018.02.21'
|
||||||
]
|
]
|
||||||
if sys.version_info < (3, 4):
|
if sys.version_info < (3, 4):
|
||||||
dependencies.append('pathlib')
|
dependencies.append('pathlib')
|
||||||
@ -36,7 +36,7 @@ if sys.version_info < (3, 4):
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="wordfreq",
|
name="wordfreq",
|
||||||
version='1.7.0',
|
version='2.0',
|
||||||
maintainer='Luminoso Technologies, Inc.',
|
maintainer='Luminoso Technologies, Inc.',
|
||||||
maintainer_email='info@luminoso.com',
|
maintainer_email='info@luminoso.com',
|
||||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||||
|
@ -1,9 +1,9 @@
|
|||||||
from wordfreq import (
|
from wordfreq import (
|
||||||
word_frequency, available_languages, cB_to_freq,
|
word_frequency, available_languages, cB_to_freq,
|
||||||
top_n_list, random_words, random_ascii_words, tokenize
|
top_n_list, random_words, random_ascii_words, tokenize, lossy_tokenize
|
||||||
)
|
)
|
||||||
from nose.tools import (
|
from nose.tools import (
|
||||||
eq_, assert_almost_equal, assert_greater, raises
|
eq_, assert_almost_equal, assert_greater, raises, assert_not_equal
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@ -15,35 +15,29 @@ def test_freq_examples():
|
|||||||
assert_greater(word_frequency('de', 'es'),
|
assert_greater(word_frequency('de', 'es'),
|
||||||
word_frequency('the', 'es'))
|
word_frequency('the', 'es'))
|
||||||
|
|
||||||
|
# We get word frequencies from the 'large' list when available
|
||||||
# To test the reasonableness of the Twitter list, we want to look up a
|
assert_greater(word_frequency('infrequency', 'en'), 0.)
|
||||||
# common word representing laughter in each language. The default for
|
|
||||||
# languages not listed here is 'haha'.
|
|
||||||
LAUGHTER_WORDS = {
|
|
||||||
'en': 'lol',
|
|
||||||
'hi': 'lol',
|
|
||||||
'cs': 'lol',
|
|
||||||
'ru': 'лол',
|
|
||||||
'zh': '笑',
|
|
||||||
'ja': '笑',
|
|
||||||
'ar': 'ﻪﻬﻬﻬﻫ',
|
|
||||||
'fa': 'خخخخ',
|
|
||||||
'ca': 'jaja',
|
|
||||||
'es': 'jaja',
|
|
||||||
'fr': 'ptdr',
|
|
||||||
'pt': 'kkkk',
|
|
||||||
'he': 'חחח',
|
|
||||||
'bg': 'ахаха',
|
|
||||||
'uk': 'хаха',
|
|
||||||
'bn': 'হা হা',
|
|
||||||
'mk': 'хаха'
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def test_languages():
|
def test_languages():
|
||||||
# Make sure the number of available languages doesn't decrease
|
# Make sure we get all the languages when looking for the default
|
||||||
|
# 'best' wordlist
|
||||||
avail = available_languages()
|
avail = available_languages()
|
||||||
assert_greater(len(avail), 26)
|
assert_greater(len(avail), 32)
|
||||||
|
|
||||||
|
# 'small' covers the same languages, but with some different lists
|
||||||
|
avail_small = available_languages('small')
|
||||||
|
eq_(len(avail_small), len(avail))
|
||||||
|
assert_not_equal(avail_small, avail)
|
||||||
|
|
||||||
|
# 'combined' is the same as 'small'
|
||||||
|
avail_old_name = available_languages('combined')
|
||||||
|
eq_(avail_old_name, avail_small)
|
||||||
|
|
||||||
|
# 'large' covers fewer languages
|
||||||
|
avail_large = available_languages('large')
|
||||||
|
assert_greater(len(avail_large), 12)
|
||||||
|
assert_greater(len(avail), len(avail_large))
|
||||||
|
|
||||||
# Look up the digit '2' in the main word list for each language
|
# Look up the digit '2' in the main word list for each language
|
||||||
for lang in avail:
|
for lang in avail:
|
||||||
@ -55,17 +49,6 @@ def test_languages():
|
|||||||
assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code)
|
assert_greater(word_frequency('2', new_lang_code), 0, new_lang_code)
|
||||||
|
|
||||||
|
|
||||||
def test_twitter():
|
|
||||||
avail = available_languages('twitter')
|
|
||||||
assert_greater(len(avail), 15)
|
|
||||||
|
|
||||||
for lang in avail:
|
|
||||||
assert_greater(word_frequency('rt', lang, 'twitter'),
|
|
||||||
word_frequency('rt', lang, 'combined'))
|
|
||||||
text = LAUGHTER_WORDS.get(lang, 'haha')
|
|
||||||
assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
|
|
||||||
|
|
||||||
|
|
||||||
def test_minimums():
|
def test_minimums():
|
||||||
eq_(word_frequency('esquivalience', 'en'), 0)
|
eq_(word_frequency('esquivalience', 'en'), 0)
|
||||||
eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6)
|
eq_(word_frequency('esquivalience', 'en', minimum=1e-6), 1e-6)
|
||||||
@ -164,13 +147,13 @@ def test_casefolding():
|
|||||||
def test_number_smashing():
|
def test_number_smashing():
|
||||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
|
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
|
||||||
['715', 'crσσks', 'by', 'bon', 'iver'])
|
['715', 'crσσks', 'by', 'bon', 'iver'])
|
||||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True),
|
eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en'),
|
||||||
['000', 'crσσks', 'by', 'bon', 'iver'])
|
['000', 'crσσks', 'by', 'bon', 'iver'])
|
||||||
eq_(tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', combine_numbers=True, include_punctuation=True),
|
eq_(lossy_tokenize('"715 - CRΣΣKS" by Bon Iver', 'en', include_punctuation=True),
|
||||||
['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver'])
|
['"', '000', '-', 'crσσks', '"', 'by', 'bon', 'iver'])
|
||||||
eq_(tokenize('1', 'en', combine_numbers=True), ['1'])
|
eq_(lossy_tokenize('1', 'en'), ['1'])
|
||||||
eq_(tokenize('3.14', 'en', combine_numbers=True), ['0.00'])
|
eq_(lossy_tokenize('3.14', 'en'), ['0.00'])
|
||||||
eq_(tokenize('24601', 'en', combine_numbers=True), ['00000'])
|
eq_(lossy_tokenize('24601', 'en'), ['00000'])
|
||||||
eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en'))
|
eq_(word_frequency('24601', 'en'), word_frequency('90210', 'en'))
|
||||||
|
|
||||||
|
|
||||||
@ -231,6 +214,7 @@ def test_ideographic_fallback():
|
|||||||
['ひらがな', 'カタカナ', 'romaji']
|
['ひらがな', 'カタカナ', 'romaji']
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def test_other_languages():
|
def test_other_languages():
|
||||||
# Test that we leave Thai letters stuck together. If we had better Thai support,
|
# Test that we leave Thai letters stuck together. If we had better Thai support,
|
||||||
# we would actually split this into a three-word phrase.
|
# we would actually split this into a three-word phrase.
|
||||||
|
@ -55,10 +55,19 @@ def test_tokens():
|
|||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
||||||
# You match the same tokens if you look it up in Traditional Chinese.
|
# Check that Traditional Chinese works at all
|
||||||
eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
|
|
||||||
assert_greater(word_frequency(fact_traditional, 'zh'), 0)
|
assert_greater(word_frequency(fact_traditional, 'zh'), 0)
|
||||||
|
|
||||||
|
# You get the same token lengths if you look it up in Traditional Chinese,
|
||||||
|
# but the words are different
|
||||||
|
simp_tokens = tokenize(fact_simplified, 'zh', include_punctuation=True)
|
||||||
|
trad_tokens = tokenize(fact_traditional, 'zh', include_punctuation=True)
|
||||||
|
eq_(''.join(simp_tokens), fact_simplified)
|
||||||
|
eq_(''.join(trad_tokens), fact_traditional)
|
||||||
|
simp_lengths = [len(token) for token in simp_tokens]
|
||||||
|
trad_lengths = [len(token) for token in trad_tokens]
|
||||||
|
eq_(simp_lengths, trad_lengths)
|
||||||
|
|
||||||
|
|
||||||
def test_combination():
|
def test_combination():
|
||||||
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
|
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
|
||||||
@ -83,5 +92,3 @@ def test_alternate_codes():
|
|||||||
# Separate codes for Mandarin and Cantonese
|
# Separate codes for Mandarin and Cantonese
|
||||||
eq_(tokenize('谢谢谢谢', 'cmn'), tokens)
|
eq_(tokenize('谢谢谢谢', 'cmn'), tokens)
|
||||||
eq_(tokenize('谢谢谢谢', 'yue'), tokens)
|
eq_(tokenize('谢谢谢谢', 'yue'), tokens)
|
||||||
|
|
||||||
|
|
||||||
|
@ -1,5 +1,6 @@
|
|||||||
from nose.tools import eq_
|
from nose.tools import eq_
|
||||||
from wordfreq import tokenize
|
from wordfreq import tokenize
|
||||||
|
from wordfreq.preprocess import preprocess_text
|
||||||
|
|
||||||
|
|
||||||
def test_transliteration():
|
def test_transliteration():
|
||||||
@ -10,6 +11,21 @@ def test_transliteration():
|
|||||||
eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'),
|
eq_(tokenize("Pa, ima tu mnogo stvari koje ne shvataš.", 'sr'),
|
||||||
['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš'])
|
['pa', 'ima', 'tu', 'mnogo', 'stvari', 'koje', 'ne', 'shvataš'])
|
||||||
|
|
||||||
|
# I don't have examples of complete sentences in Azerbaijani that are
|
||||||
|
# naturally in Cyrillic, because it turns out everyone writes Azerbaijani
|
||||||
|
# in Latin letters on the Internet, _except_ sometimes for Wiktionary.
|
||||||
|
# So here are some individual words.
|
||||||
|
|
||||||
|
# 'library' in Azerbaijani Cyrillic
|
||||||
|
eq_(preprocess_text('китабхана', 'az'), 'kitabxana')
|
||||||
|
eq_(preprocess_text('КИТАБХАНА', 'az'), 'kitabxana')
|
||||||
|
eq_(preprocess_text('KİTABXANA', 'az'), 'kitabxana')
|
||||||
|
|
||||||
|
# 'scream' in Azerbaijani Cyrillic
|
||||||
|
eq_(preprocess_text('бағырты', 'az'), 'bağırtı')
|
||||||
|
eq_(preprocess_text('БАҒЫРТЫ', 'az'), 'bağırtı')
|
||||||
|
eq_(preprocess_text('BAĞIRTI', 'az'), 'bağırtı')
|
||||||
|
|
||||||
|
|
||||||
def test_actually_russian():
|
def test_actually_russian():
|
||||||
# This looks mostly like Serbian, but was probably actually Russian.
|
# This looks mostly like Serbian, but was probably actually Russian.
|
@ -1,4 +1,3 @@
|
|||||||
from wordfreq.tokens import tokenize, simple_tokenize
|
|
||||||
from pkg_resources import resource_filename
|
from pkg_resources import resource_filename
|
||||||
from functools import lru_cache
|
from functools import lru_cache
|
||||||
import langcodes
|
import langcodes
|
||||||
@ -10,18 +9,15 @@ import random
|
|||||||
import logging
|
import logging
|
||||||
import math
|
import math
|
||||||
|
|
||||||
|
from .tokens import tokenize, simple_tokenize, lossy_tokenize
|
||||||
|
from .language_info import get_language_info
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
CACHE_SIZE = 100000
|
CACHE_SIZE = 100000
|
||||||
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
|
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
|
||||||
|
|
||||||
# Chinese and Japanese are written without spaces. In Chinese, in particular,
|
|
||||||
# we have to infer word boundaries from the frequencies of the words they
|
|
||||||
# would create. When this happens, we should adjust the resulting frequency
|
|
||||||
# to avoid creating a bias toward improbable word combinations.
|
|
||||||
INFERRED_SPACE_LANGUAGES = {'zh'}
|
|
||||||
|
|
||||||
# We'll divide the frequency by 10 for each token boundary that was inferred.
|
# We'll divide the frequency by 10 for each token boundary that was inferred.
|
||||||
# (We determined the factor of 10 empirically by looking at words in the
|
# (We determined the factor of 10 empirically by looking at words in the
|
||||||
# Chinese wordlist that weren't common enough to be identified by the
|
# Chinese wordlist that weren't common enough to be identified by the
|
||||||
@ -30,8 +26,9 @@ INFERRED_SPACE_LANGUAGES = {'zh'}
|
|||||||
# frequency.)
|
# frequency.)
|
||||||
INFERRED_SPACE_FACTOR = 10.0
|
INFERRED_SPACE_FACTOR = 10.0
|
||||||
|
|
||||||
# simple_tokenize is imported so that other things can import it from here.
|
# tokenize and simple_tokenize are imported so that other things can import
|
||||||
# Suppress the pyflakes warning.
|
# them from here. Suppress the pyflakes warning.
|
||||||
|
tokenize = tokenize
|
||||||
simple_tokenize = simple_tokenize
|
simple_tokenize = simple_tokenize
|
||||||
|
|
||||||
|
|
||||||
@ -87,11 +84,21 @@ def read_cBpack(filename):
|
|||||||
return data[1:]
|
return data[1:]
|
||||||
|
|
||||||
|
|
||||||
def available_languages(wordlist='combined'):
|
def available_languages(wordlist='best'):
|
||||||
"""
|
"""
|
||||||
List the languages (as language-code strings) that the wordlist of a given
|
Given a wordlist name, return a dictionary of language codes to filenames,
|
||||||
name is available in.
|
representing all the languages in which that wordlist is available.
|
||||||
"""
|
"""
|
||||||
|
if wordlist == 'best':
|
||||||
|
available = available_languages('small')
|
||||||
|
available.update(available_languages('large'))
|
||||||
|
return available
|
||||||
|
elif wordlist == 'combined':
|
||||||
|
logger.warning(
|
||||||
|
"The 'combined' wordlists have been renamed to 'small'."
|
||||||
|
)
|
||||||
|
wordlist = 'small'
|
||||||
|
|
||||||
available = {}
|
available = {}
|
||||||
for path in DATA_PATH.glob('*.msgpack.gz'):
|
for path in DATA_PATH.glob('*.msgpack.gz'):
|
||||||
if not path.name.startswith('_'):
|
if not path.name.startswith('_'):
|
||||||
@ -103,7 +110,7 @@ def available_languages(wordlist='combined'):
|
|||||||
|
|
||||||
|
|
||||||
@lru_cache(maxsize=None)
|
@lru_cache(maxsize=None)
|
||||||
def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
|
def get_frequency_list(lang, wordlist='best', match_cutoff=30):
|
||||||
"""
|
"""
|
||||||
Read the raw data from a wordlist file, returning it as a list of
|
Read the raw data from a wordlist file, returning it as a list of
|
||||||
lists. (See `read_cBpack` for what this represents.)
|
lists. (See `read_cBpack` for what this represents.)
|
||||||
@ -117,7 +124,8 @@ def get_frequency_list(lang, wordlist='combined', match_cutoff=30):
|
|||||||
best, score = langcodes.best_match(lang, list(available),
|
best, score = langcodes.best_match(lang, list(available),
|
||||||
min_score=match_cutoff)
|
min_score=match_cutoff)
|
||||||
if score == 0:
|
if score == 0:
|
||||||
raise LookupError("No wordlist available for language %r" % lang)
|
raise LookupError("No wordlist %r available for language %r"
|
||||||
|
% (wordlist, lang))
|
||||||
|
|
||||||
if best != lang:
|
if best != lang:
|
||||||
logger.warning(
|
logger.warning(
|
||||||
@ -184,7 +192,7 @@ def freq_to_zipf(freq):
|
|||||||
|
|
||||||
|
|
||||||
@lru_cache(maxsize=None)
|
@lru_cache(maxsize=None)
|
||||||
def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
|
def get_frequency_dict(lang, wordlist='best', match_cutoff=30):
|
||||||
"""
|
"""
|
||||||
Get a word frequency list as a dictionary, mapping tokens to
|
Get a word frequency list as a dictionary, mapping tokens to
|
||||||
frequencies as floating-point probabilities.
|
frequencies as floating-point probabilities.
|
||||||
@ -198,7 +206,7 @@ def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
|
|||||||
return freqs
|
return freqs
|
||||||
|
|
||||||
|
|
||||||
def iter_wordlist(lang, wordlist='combined'):
|
def iter_wordlist(lang, wordlist='best'):
|
||||||
"""
|
"""
|
||||||
Yield the words in a wordlist in approximate descending order of
|
Yield the words in a wordlist in approximate descending order of
|
||||||
frequency.
|
frequency.
|
||||||
@ -215,8 +223,9 @@ def iter_wordlist(lang, wordlist='combined'):
|
|||||||
# it takes to look up frequencies from scratch, so something faster is needed.
|
# it takes to look up frequencies from scratch, so something faster is needed.
|
||||||
_wf_cache = {}
|
_wf_cache = {}
|
||||||
|
|
||||||
|
|
||||||
def _word_frequency(word, lang, wordlist, minimum):
|
def _word_frequency(word, lang, wordlist, minimum):
|
||||||
tokens = tokenize(word, lang, combine_numbers=True)
|
tokens = lossy_tokenize(word, lang)
|
||||||
if not tokens:
|
if not tokens:
|
||||||
return minimum
|
return minimum
|
||||||
|
|
||||||
@ -234,39 +243,31 @@ def _word_frequency(word, lang, wordlist, minimum):
|
|||||||
|
|
||||||
freq = 1.0 / one_over_result
|
freq = 1.0 / one_over_result
|
||||||
|
|
||||||
if lang in INFERRED_SPACE_LANGUAGES:
|
if get_language_info(lang)['tokenizer'] == 'jieba':
|
||||||
|
# If we used the Jieba tokenizer, we could tokenize anything to match
|
||||||
|
# our wordlist, even nonsense. To counteract this, we multiply by a
|
||||||
|
# probability for each word break that was inferred.
|
||||||
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
|
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
|
||||||
|
|
||||||
return max(freq, minimum)
|
return max(freq, minimum)
|
||||||
|
|
||||||
|
|
||||||
def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
def word_frequency(word, lang, wordlist='best', minimum=0.):
|
||||||
"""
|
"""
|
||||||
Get the frequency of `word` in the language with code `lang`, from the
|
Get the frequency of `word` in the language with code `lang`, from the
|
||||||
specified `wordlist`. The default wordlist is 'combined', built from
|
specified `wordlist`.
|
||||||
whichever of these five sources have sufficient data for the language:
|
|
||||||
|
|
||||||
- Full text of Wikipedia
|
These wordlists can be specified:
|
||||||
- A sample of 72 million tweets collected from Twitter in 2014,
|
|
||||||
divided roughly into languages using automatic language detection
|
|
||||||
- Frequencies extracted from OpenSubtitles
|
|
||||||
- The Leeds Internet Corpus
|
|
||||||
- Google Books Syntactic Ngrams 2013
|
|
||||||
|
|
||||||
Another available wordlist is 'twitter', which uses only the data from
|
- 'large': a wordlist built from at least 5 sources, containing word
|
||||||
Twitter.
|
frequencies of 10^-8 and higher
|
||||||
|
- 'small': a wordlist built from at least 3 sources, containing word
|
||||||
Words that we believe occur at least once per million tokens, based on
|
frquencies of 10^-6 and higher
|
||||||
the average of these lists, will appear in the word frequency list.
|
- 'best': uses 'large' if available, and 'small' otherwise
|
||||||
|
|
||||||
The value returned will always be at least as large as `minimum`.
|
The value returned will always be at least as large as `minimum`.
|
||||||
|
You could set this value to 10^-8, for example, to return 10^-8 for
|
||||||
If a word decomposes into multiple tokens, we'll return a smoothed estimate
|
unknown words in the 'large' list instead of 0, avoiding a discontinuity.
|
||||||
of the word frequency that is no greater than the frequency of any of its
|
|
||||||
individual tokens.
|
|
||||||
|
|
||||||
It should be noted that the current tokenizer does not support
|
|
||||||
multi-word Chinese phrases.
|
|
||||||
"""
|
"""
|
||||||
args = (word, lang, wordlist, minimum)
|
args = (word, lang, wordlist, minimum)
|
||||||
try:
|
try:
|
||||||
@ -278,7 +279,7 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
|||||||
return _wf_cache[args]
|
return _wf_cache[args]
|
||||||
|
|
||||||
|
|
||||||
def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
def zipf_frequency(word, lang, wordlist='best', minimum=0.):
|
||||||
"""
|
"""
|
||||||
Get the frequency of `word`, in the language with code `lang`, on the Zipf
|
Get the frequency of `word`, in the language with code `lang`, on the Zipf
|
||||||
scale.
|
scale.
|
||||||
@ -306,7 +307,7 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
|||||||
|
|
||||||
|
|
||||||
@lru_cache(maxsize=100)
|
@lru_cache(maxsize=100)
|
||||||
def top_n_list(lang, n, wordlist='combined', ascii_only=False):
|
def top_n_list(lang, n, wordlist='best', ascii_only=False):
|
||||||
"""
|
"""
|
||||||
Return a frequency list of length `n` in descending order of frequency.
|
Return a frequency list of length `n` in descending order of frequency.
|
||||||
This list contains words from `wordlist`, of the given language.
|
This list contains words from `wordlist`, of the given language.
|
||||||
@ -321,7 +322,7 @@ def top_n_list(lang, n, wordlist='combined', ascii_only=False):
|
|||||||
return results
|
return results
|
||||||
|
|
||||||
|
|
||||||
def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
|
def random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12,
|
||||||
ascii_only=False):
|
ascii_only=False):
|
||||||
"""
|
"""
|
||||||
Returns a string of random, space separated words.
|
Returns a string of random, space separated words.
|
||||||
@ -346,7 +347,7 @@ def random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12,
|
|||||||
return ' '.join([random.choice(choices) for i in range(nwords)])
|
return ' '.join([random.choice(choices) for i in range(nwords)])
|
||||||
|
|
||||||
|
|
||||||
def random_ascii_words(lang='en', wordlist='combined', nwords=5,
|
def random_ascii_words(lang='en', wordlist='best', nwords=5,
|
||||||
bits_per_word=12):
|
bits_per_word=12):
|
||||||
"""
|
"""
|
||||||
Returns a string of random, space separated, ASCII words.
|
Returns a string of random, space separated, ASCII words.
|
||||||
|
@ -49,4 +49,11 @@ def jieba_tokenize(text, external_wordlist=False):
|
|||||||
else:
|
else:
|
||||||
if jieba_tokenizer is None:
|
if jieba_tokenizer is None:
|
||||||
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
|
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
|
||||||
return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
|
|
||||||
|
# Tokenize the Simplified Chinese version of the text, but return
|
||||||
|
# those spans from the original text, even if it's in Traditional
|
||||||
|
# Chinese
|
||||||
|
tokens = []
|
||||||
|
for _token, start, end in jieba_tokenizer.tokenize(simplify_chinese(text), HMM=False):
|
||||||
|
tokens.append(text[start:end])
|
||||||
|
return tokens
|
||||||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/small_ar.msgpack.gz
Normal file
BIN
wordfreq/data/small_ar.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_bg.msgpack.gz
Normal file
BIN
wordfreq/data/small_bg.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_bn.msgpack.gz
Normal file
BIN
wordfreq/data/small_bn.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ca.msgpack.gz
Normal file
BIN
wordfreq/data/small_ca.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_cs.msgpack.gz
Normal file
BIN
wordfreq/data/small_cs.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_da.msgpack.gz
Normal file
BIN
wordfreq/data/small_da.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_de.msgpack.gz
Normal file
BIN
wordfreq/data/small_de.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_el.msgpack.gz
Normal file
BIN
wordfreq/data/small_el.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_en.msgpack.gz
Normal file
BIN
wordfreq/data/small_en.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_es.msgpack.gz
Normal file
BIN
wordfreq/data/small_es.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fa.msgpack.gz
Normal file
BIN
wordfreq/data/small_fa.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fi.msgpack.gz
Normal file
BIN
wordfreq/data/small_fi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_fr.msgpack.gz
Normal file
BIN
wordfreq/data/small_fr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_he.msgpack.gz
Normal file
BIN
wordfreq/data/small_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_hi.msgpack.gz
Normal file
BIN
wordfreq/data/small_hi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_hu.msgpack.gz
Normal file
BIN
wordfreq/data/small_hu.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_id.msgpack.gz
Normal file
BIN
wordfreq/data/small_id.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_it.msgpack.gz
Normal file
BIN
wordfreq/data/small_it.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ja.msgpack.gz
Normal file
BIN
wordfreq/data/small_ja.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ko.msgpack.gz
Normal file
BIN
wordfreq/data/small_ko.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_mk.msgpack.gz
Normal file
BIN
wordfreq/data/small_mk.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ms.msgpack.gz
Normal file
BIN
wordfreq/data/small_ms.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_nb.msgpack.gz
Normal file
BIN
wordfreq/data/small_nb.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_nl.msgpack.gz
Normal file
BIN
wordfreq/data/small_nl.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_pl.msgpack.gz
Normal file
BIN
wordfreq/data/small_pl.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_pt.msgpack.gz
Normal file
BIN
wordfreq/data/small_pt.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ro.msgpack.gz
Normal file
BIN
wordfreq/data/small_ro.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_ru.msgpack.gz
Normal file
BIN
wordfreq/data/small_ru.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_sh.msgpack.gz
Normal file
BIN
wordfreq/data/small_sh.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_sv.msgpack.gz
Normal file
BIN
wordfreq/data/small_sv.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_tr.msgpack.gz
Normal file
BIN
wordfreq/data/small_tr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_uk.msgpack.gz
Normal file
BIN
wordfreq/data/small_uk.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/small_zh.msgpack.gz
Normal file
BIN
wordfreq/data/small_zh.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user