Merge pull request #58 from LuminosoInsight/significant-figures

Round wordfreq output to 3 sig. figs, and update documentation
This commit is contained in:
Lance Nathan 2018-06-25 18:53:39 -04:00 committed by GitHub
commit 79caa526c3
7 changed files with 137 additions and 69 deletions

View File

@ -1,3 +1,20 @@
## Version 2.1 (2018-06-18)
Data changes:
- Updated to the data from the latest Exquisite Corpus, which adds the
ParaCrawl web crawl and updates to OpenSubtitles 2018
- Added small word list for Latvian
- Added large word list for Czech
- The Dutch large word list once again has 5 data sources
Library change:
- The output of `word_frequency` is rounded to three significant digits. This
provides friendlier output, and better reflects the precision of the
underlying data anyway.
## Version 2.0.1 (2018-05-01)
Fixed edge cases that inserted spurious token boundaries when Japanese text is

168
README.md
View File

@ -1,4 +1,5 @@
wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.
wordfreq is a Python library for looking up the frequencies of words in many
languages, based on many sources of data.
Author: Robyn Speer
@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage
wordfreq provides access to estimates of the frequency with which a word is
used, in 35 languages (see *Supported languages* below).
used, in 36 languages (see *Supported languages* below). It uses many different
data sources, not just one corpus.
It provides both 'small' and 'large' wordlists:
@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is:
word_frequency(word, lang, wordlist='best', minimum=0.0)
This function looks up a word's frequency in the given language, returning its
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
frequencies by a million (1e6) to get more readable numbers:
frequency as a decimal between 0 and 1.
>>> from wordfreq import word_frequency
>>> word_frequency('cafe', 'en') * 1e6
11.748975549395302
>>> word_frequency('cafe', 'en')
1.07e-05
>>> word_frequency('café', 'en') * 1e6
3.890451449942805
>>> word_frequency('café', 'en')
5.89e-06
>>> word_frequency('cafe', 'fr') * 1e6
1.4454397707459279
>>> word_frequency('cafe', 'fr')
1.51e-06
>>> word_frequency('café', 'fr') * 1e6
53.70317963702532
>>> word_frequency('café', 'fr')
5.25e-05
`zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -74,13 +75,13 @@ one occurrence per billion words.
7.77
>>> zipf_frequency('word', 'en')
5.32
5.29
>>> zipf_frequency('frequency', 'en')
4.38
4.42
>>> zipf_frequency('zipf', 'en')
1.32
1.55
>>> zipf_frequency('zipf', 'en', wordlist='small')
0.0
@ -102,45 +103,39 @@ The parameters to `word_frequency` and `zipf_frequency` are:
value contained in the wordlist, to avoid a discontinuity where the wordlist
ends.
Other functions:
`tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*.
## Frequency bins
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
the list, in descending frequency order.
wordfreq's wordlists are designed to load quickly and take up little space in
the repository. We accomplish this by avoiding meaningless precision and
packing the words into frequency bins.
>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
In wordfreq, all words that have the same Zipf frequency rounded to the nearest
hundredth have the same frequency. We don't store any more precision than that.
So instead of having to store that the frequency of a word is
.000011748975549395302, where most of those digits are meaningless, we just store
the frequency bins and the words they contain.
>>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
Because the Zipf scale is a logarithmic scale, this preserves the same relative
precision no matter how far down you are in the word list. The frequency of any
word is precise to within 1%.
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
wordlist, in descending frequency order.
(This is not a claim about _accuracy_, but about _precision_. We believe that
the way we use multiple data sources and discard outliers makes wordfreq a
more accurate measurement of the way these words are really used in written
language, but it's unclear how one would measure this accuracy.)
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides.
`supported_languages(wordlist='best')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language.
## The figure-skating metric
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words.
We combine word frequencies from different sources in a way that's designed
to minimize the impact of outliers. The method reminds me of the scoring system
in Olympic figure skating:
If you happen to want an easy way to get [a memorable, xkcd-style
password][xkcd936] with 60 bits of entropy, this function will almost do the
job. In this case, you should actually run the similar function
`random_ascii_words`, limiting the selection to words that can be typed in
ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password
- Find the frequency of each word according to each data source.
- For each word, drop the sources that give it the highest and lowest frequency.
- Average the remaining frequencies.
- Rescale the resulting frequency list to add up to 1.
## Sources and supported languages
@ -155,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves
come from multiple sources:
- **Wikipedia**, representing encyclopedic text
- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX
- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
- **News**, from NewsCrawl 2014 and GlobalVoices
- **Books**, from Google Books Ngrams 2012
- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus
- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK
Hungarian Webcorpus
- **Twitter**, representing short-form social media
- **Reddit**, representing potentially longer Internet comments
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
that comes with the Jieba word segmenter, whose provenance we don't really know
that comes with the Jieba word segmenter, whose provenance we don't really
know
The following languages are supported, with reasonable tokenization and at
least 3 different sources of word frequencies:
@ -224,6 +221,52 @@ between 1.0 and 3.0. These are available in 14 languages that are covered by
enough data sources.
## Other functions
`tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*.
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
the list, in descending frequency order.
>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
>>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
wordlist, in descending frequency order.
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides.
`available_languages(wordlist='best')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language.
`get_language_info(lang)` returns a dictionary of information about how we
preprocess text in this language, such as what script we expect it to be
written in, which characters we normalize together, and how we tokenize it.
See its docstring for more information.
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words.
If you happen to want an easy way to get [a memorable, xkcd-style
password][xkcd936] with 60 bits of entropy, this function will almost do the
job. In this case, you should actually run the similar function
`random_ascii_words`, limiting the selection to words that can be typed in
ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password
## Tokenization
wordfreq uses the Python package `regex`, which is a more advanced
@ -255,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break
into multiple tokens:
>>> zipf_frequency('New York', 'en')
5.35
5.28
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.54
3.57
The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese,
@ -272,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
their frequency:
>>> zipf_frequency('owl-flavored', 'en')
3.18
3.2
## Multi-script languages
@ -411,14 +454,14 @@ sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies (http://corpus.leeds.ac.uk/list.html)
- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
(https://invokeit.wordpress.com/frequency-word-lists/)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
It contains data from OPUS OpenSubtitles 2018
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
OpenSubtitles project (http://www.opensubtitles.org/).
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
attribution to OpenSubtitles.
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
@ -483,10 +526,6 @@ The same citation in BibTex format:
Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
(2015). The word frequency effect. Experimental Psychology.
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
(2011). The word frequency effect: A review of recent developments and
implications for the choice of frequency estimates in German. Experimental
@ -496,9 +535,6 @@ The same citation in BibTex format:
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
- Dave, H. (2011). Frequency word lists.
https://invokeit.wordpress.com/frequency-word-lists/
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/
@ -516,11 +552,19 @@ The same citation in BibTex format:
analyzer.
http://mecab.sourceforge.net/
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
Proceedings of the ACL 2012 system demonstrations, 169-174.
http://aclweb.org/anthology/P12-3029
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
International Conference on Language Resources and Evaluation (LREC 2016).
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
European Languages. https://paracrawl.eu/
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.

View File

@ -1,2 +1,2 @@
[pytest]
addopts = --doctest-modules
addopts = --doctest-modules --doctest-glob=README.md

View File

@ -162,7 +162,7 @@ def test_phrase_freq():
ff = word_frequency("flip-flop", 'en')
assert ff > 0
phrase_freq = 1.0 / word_frequency('flip', 'en') + 1.0 / word_frequency('flop', 'en')
assert 1.0 / ff == pytest.approx(phrase_freq)
assert 1.0 / ff == pytest.approx(phrase_freq, rel=0.01)
def test_not_really_random():

View File

@ -49,11 +49,11 @@ def test_combination():
gozai_freq = word_frequency('ござい', 'ja')
masu_freq = word_frequency('ます', 'ja')
assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2)
assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2, rel=0.01)
assert (
1.0 / word_frequency('おはようございます', 'ja') ==
pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq)
pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq, rel=0.01)
)

View File

@ -10,9 +10,9 @@ def test_combination():
gamsa_freq = word_frequency('감사', 'ko')
habnida_freq = word_frequency('합니다', 'ko')
assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2)
assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2, rel=0.01)
assert (
1.0 / word_frequency('감사합니다', 'ko') ==
pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq)
pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01)
)

View File

@ -249,7 +249,14 @@ def _word_frequency(word, lang, wordlist, minimum):
# probability for each word break that was inferred.
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
return max(freq, minimum)
# All our frequency data is only precise to within 1% anyway, so round
# it to 3 significant digits
unrounded = max(freq, minimum)
if unrounded == 0.:
return 0.
else:
leading_zeroes = math.floor(-math.log(unrounded, 10))
return round(unrounded, leading_zeroes + 3)
def word_frequency(word, lang, wordlist='best', minimum=0.):