Merge pull request #58 from LuminosoInsight/significant-figures

Round wordfreq output to 3 sig. figs, and update documentation
This commit is contained in:
Lance Nathan 2018-06-25 18:53:39 -04:00 committed by GitHub
commit 79caa526c3
7 changed files with 137 additions and 69 deletions

View File

@ -1,3 +1,20 @@
## Version 2.1 (2018-06-18)
Data changes:
- Updated to the data from the latest Exquisite Corpus, which adds the
ParaCrawl web crawl and updates to OpenSubtitles 2018
- Added small word list for Latvian
- Added large word list for Czech
- The Dutch large word list once again has 5 data sources
Library change:
- The output of `word_frequency` is rounded to three significant digits. This
provides friendlier output, and better reflects the precision of the
underlying data anyway.
## Version 2.0.1 (2018-05-01) ## Version 2.0.1 (2018-05-01)
Fixed edge cases that inserted spurious token boundaries when Japanese text is Fixed edge cases that inserted spurious token boundaries when Japanese text is

168
README.md
View File

@ -1,4 +1,5 @@
wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data. wordfreq is a Python library for looking up the frequencies of words in many
languages, based on many sources of data.
Author: Robyn Speer Author: Robyn Speer
@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
used, in 35 languages (see *Supported languages* below). used, in 36 languages (see *Supported languages* below). It uses many different
data sources, not just one corpus.
It provides both 'small' and 'large' wordlists: It provides both 'small' and 'large' wordlists:
@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is:
word_frequency(word, lang, wordlist='best', minimum=0.0) word_frequency(word, lang, wordlist='best', minimum=0.0)
This function looks up a word's frequency in the given language, returning its This function looks up a word's frequency in the given language, returning its
frequency as a decimal between 0 and 1. In these examples, we'll multiply the frequency as a decimal between 0 and 1.
frequencies by a million (1e6) to get more readable numbers:
>>> from wordfreq import word_frequency >>> from wordfreq import word_frequency
>>> word_frequency('cafe', 'en') * 1e6 >>> word_frequency('cafe', 'en')
11.748975549395302 1.07e-05
>>> word_frequency('café', 'en') * 1e6 >>> word_frequency('café', 'en')
3.890451449942805 5.89e-06
>>> word_frequency('cafe', 'fr') * 1e6 >>> word_frequency('cafe', 'fr')
1.4454397707459279 1.51e-06
>>> word_frequency('café', 'fr') * 1e6 >>> word_frequency('café', 'fr')
53.70317963702532 5.25e-05
`zipf_frequency` is a variation on `word_frequency` that aims to return the `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -74,13 +75,13 @@ one occurrence per billion words.
7.77 7.77
>>> zipf_frequency('word', 'en') >>> zipf_frequency('word', 'en')
5.32 5.29
>>> zipf_frequency('frequency', 'en') >>> zipf_frequency('frequency', 'en')
4.38 4.42
>>> zipf_frequency('zipf', 'en') >>> zipf_frequency('zipf', 'en')
1.32 1.55
>>> zipf_frequency('zipf', 'en', wordlist='small') >>> zipf_frequency('zipf', 'en', wordlist='small')
0.0 0.0
@ -102,45 +103,39 @@ The parameters to `word_frequency` and `zipf_frequency` are:
value contained in the wordlist, to avoid a discontinuity where the wordlist value contained in the wordlist, to avoid a discontinuity where the wordlist
ends. ends.
Other functions:
`tokenize(text, lang)` splits text in the given language into words, in the same ## Frequency bins
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*.
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in wordfreq's wordlists are designed to load quickly and take up little space in
the list, in descending frequency order. the repository. We accomplish this by avoiding meaningless precision and
packing the words into frequency bins.
>>> from wordfreq import top_n_list In wordfreq, all words that have the same Zipf frequency rounded to the nearest
>>> top_n_list('en', 10) hundredth have the same frequency. We don't store any more precision than that.
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for'] So instead of having to store that the frequency of a word is
.000011748975549395302, where most of those digits are meaningless, we just store
the frequency bins and the words they contain.
>>> top_n_list('es', 10) Because the Zipf scale is a logarithmic scale, this preserves the same relative
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se'] precision no matter how far down you are in the word list. The frequency of any
word is precise to within 1%.
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a (This is not a claim about _accuracy_, but about _precision_. We believe that
wordlist, in descending frequency order. the way we use multiple data sources and discard outliers makes wordfreq a
more accurate measurement of the way these words are really used in written
language, but it's unclear how one would measure this accuracy.)
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides.
`supported_languages(wordlist='best')` returns a dictionary whose keys are ## The figure-skating metric
language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language.
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)` We combine word frequencies from different sources in a way that's designed
returns a selection of random words, separated by spaces. `bits_per_word=n` to minimize the impact of outliers. The method reminds me of the scoring system
will select each random word from 2^n words. in Olympic figure skating:
If you happen to want an easy way to get [a memorable, xkcd-style - Find the frequency of each word according to each data source.
password][xkcd936] with 60 bits of entropy, this function will almost do the - For each word, drop the sources that give it the highest and lowest frequency.
job. In this case, you should actually run the similar function - Average the remaining frequencies.
`random_ascii_words`, limiting the selection to words that can be typed in - Rescale the resulting frequency list to add up to 1.
ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password
## Sources and supported languages ## Sources and supported languages
@ -155,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves
come from multiple sources: come from multiple sources:
- **Wikipedia**, representing encyclopedic text - **Wikipedia**, representing encyclopedic text
- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX - **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
- **News**, from NewsCrawl 2014 and GlobalVoices - **News**, from NewsCrawl 2014 and GlobalVoices
- **Books**, from Google Books Ngrams 2012 - **Books**, from Google Books Ngrams 2012
- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus - **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK
Hungarian Webcorpus
- **Twitter**, representing short-form social media - **Twitter**, representing short-form social media
- **Reddit**, representing potentially longer Internet comments - **Reddit**, representing potentially longer Internet comments
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist - **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
that comes with the Jieba word segmenter, whose provenance we don't really know that comes with the Jieba word segmenter, whose provenance we don't really
know
The following languages are supported, with reasonable tokenization and at The following languages are supported, with reasonable tokenization and at
least 3 different sources of word frequencies: least 3 different sources of word frequencies:
@ -224,6 +221,52 @@ between 1.0 and 3.0. These are available in 14 languages that are covered by
enough data sources. enough data sources.
## Other functions
`tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*.
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
the list, in descending frequency order.
>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
>>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
wordlist, in descending frequency order.
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides.
`available_languages(wordlist='best')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language.
`get_language_info(lang)` returns a dictionary of information about how we
preprocess text in this language, such as what script we expect it to be
written in, which characters we normalize together, and how we tokenize it.
See its docstring for more information.
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words.
If you happen to want an easy way to get [a memorable, xkcd-style
password][xkcd936] with 60 bits of entropy, this function will almost do the
job. In this case, you should actually run the similar function
`random_ascii_words`, limiting the selection to words that can be typed in
ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password
## Tokenization ## Tokenization
wordfreq uses the Python package `regex`, which is a more advanced wordfreq uses the Python package `regex`, which is a more advanced
@ -255,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break
into multiple tokens: into multiple tokens:
>>> zipf_frequency('New York', 'en') >>> zipf_frequency('New York', 'en')
5.35 5.28
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway" >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.54 3.57
The word frequencies are combined with the half-harmonic-mean function in order The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese, to provide an estimate of what their combined frequency would be. In Chinese,
@ -272,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
their frequency: their frequency:
>>> zipf_frequency('owl-flavored', 'en') >>> zipf_frequency('owl-flavored', 'en')
3.18 3.2
## Multi-script languages ## Multi-script languages
@ -411,14 +454,14 @@ sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation - The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies (http://corpus.leeds.ac.uk/list.html) Studies (http://corpus.leeds.ac.uk/list.html)
- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
(https://invokeit.wordpress.com/frequency-word-lists/)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org) - Wikipedia, the free encyclopedia (http://www.wikipedia.org)
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
It contains data from OPUS OpenSubtitles 2018 It contains data from OPUS OpenSubtitles 2018
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the (http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
OpenSubtitles project (http://www.opensubtitles.org/). OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
attribution to OpenSubtitles.
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
@ -483,10 +526,6 @@ The same citation in BibTex format:
Methods, 41 (4), 977-990. Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
(2015). The word frequency effect. Experimental Psychology.
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. - Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
(2011). The word frequency effect: A review of recent developments and (2011). The word frequency effect: A review of recent developments and
implications for the choice of frequency estimates in German. Experimental implications for the choice of frequency estimates in German. Experimental
@ -496,9 +535,6 @@ The same citation in BibTex format:
frequencies based on film subtitles. PLoS One, 5(6), e10729. frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
- Dave, H. (2011). Frequency word lists.
https://invokeit.wordpress.com/frequency-word-lists/
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/ http://unicode.org/reports/tr29/
@ -516,11 +552,19 @@ The same citation in BibTex format:
analyzer. analyzer.
http://mecab.sourceforge.net/ http://mecab.sourceforge.net/
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
Proceedings of the ACL 2012 system demonstrations, 169-174.
http://aclweb.org/anthology/P12-3029
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large - Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
International Conference on Language Resources and Evaluation (LREC 2016). International Conference on Language Resources and Evaluation (LREC 2016).
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
European Languages. https://paracrawl.eu/
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). - van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English. SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.

View File

@ -1,2 +1,2 @@
[pytest] [pytest]
addopts = --doctest-modules addopts = --doctest-modules --doctest-glob=README.md

View File

@ -162,7 +162,7 @@ def test_phrase_freq():
ff = word_frequency("flip-flop", 'en') ff = word_frequency("flip-flop", 'en')
assert ff > 0 assert ff > 0
phrase_freq = 1.0 / word_frequency('flip', 'en') + 1.0 / word_frequency('flop', 'en') phrase_freq = 1.0 / word_frequency('flip', 'en') + 1.0 / word_frequency('flop', 'en')
assert 1.0 / ff == pytest.approx(phrase_freq) assert 1.0 / ff == pytest.approx(phrase_freq, rel=0.01)
def test_not_really_random(): def test_not_really_random():

View File

@ -49,11 +49,11 @@ def test_combination():
gozai_freq = word_frequency('ござい', 'ja') gozai_freq = word_frequency('ござい', 'ja')
masu_freq = word_frequency('ます', 'ja') masu_freq = word_frequency('ます', 'ja')
assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2) assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2, rel=0.01)
assert ( assert (
1.0 / word_frequency('おはようございます', 'ja') == 1.0 / word_frequency('おはようございます', 'ja') ==
pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq) pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq, rel=0.01)
) )

View File

@ -10,9 +10,9 @@ def test_combination():
gamsa_freq = word_frequency('감사', 'ko') gamsa_freq = word_frequency('감사', 'ko')
habnida_freq = word_frequency('합니다', 'ko') habnida_freq = word_frequency('합니다', 'ko')
assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2) assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2, rel=0.01)
assert ( assert (
1.0 / word_frequency('감사합니다', 'ko') == 1.0 / word_frequency('감사합니다', 'ko') ==
pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq) pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01)
) )

View File

@ -249,7 +249,14 @@ def _word_frequency(word, lang, wordlist, minimum):
# probability for each word break that was inferred. # probability for each word break that was inferred.
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1) freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
return max(freq, minimum) # All our frequency data is only precise to within 1% anyway, so round
# it to 3 significant digits
unrounded = max(freq, minimum)
if unrounded == 0.:
return 0.
else:
leading_zeroes = math.floor(-math.log(unrounded, 10))
return round(unrounded, leading_zeroes + 3)
def word_frequency(word, lang, wordlist='best', minimum=0.): def word_frequency(word, lang, wordlist='best', minimum=0.):