mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #58 from LuminosoInsight/significant-figures
Round wordfreq output to 3 sig. figs, and update documentation
This commit is contained in:
commit
79caa526c3
17
CHANGELOG.md
17
CHANGELOG.md
@ -1,3 +1,20 @@
|
||||
## Version 2.1 (2018-06-18)
|
||||
|
||||
Data changes:
|
||||
|
||||
- Updated to the data from the latest Exquisite Corpus, which adds the
|
||||
ParaCrawl web crawl and updates to OpenSubtitles 2018
|
||||
- Added small word list for Latvian
|
||||
- Added large word list for Czech
|
||||
- The Dutch large word list once again has 5 data sources
|
||||
|
||||
Library change:
|
||||
|
||||
- The output of `word_frequency` is rounded to three significant digits. This
|
||||
provides friendlier output, and better reflects the precision of the
|
||||
underlying data anyway.
|
||||
|
||||
|
||||
## Version 2.0.1 (2018-05-01)
|
||||
|
||||
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
||||
|
168
README.md
168
README.md
@ -1,4 +1,5 @@
|
||||
wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.
|
||||
wordfreq is a Python library for looking up the frequencies of words in many
|
||||
languages, based on many sources of data.
|
||||
|
||||
Author: Robyn Speer
|
||||
|
||||
@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 35 languages (see *Supported languages* below).
|
||||
used, in 36 languages (see *Supported languages* below). It uses many different
|
||||
data sources, not just one corpus.
|
||||
|
||||
It provides both 'small' and 'large' wordlists:
|
||||
|
||||
@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is:
|
||||
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
||||
|
||||
This function looks up a word's frequency in the given language, returning its
|
||||
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
||||
frequencies by a million (1e6) to get more readable numbers:
|
||||
frequency as a decimal between 0 and 1.
|
||||
|
||||
>>> from wordfreq import word_frequency
|
||||
>>> word_frequency('cafe', 'en') * 1e6
|
||||
11.748975549395302
|
||||
>>> word_frequency('cafe', 'en')
|
||||
1.07e-05
|
||||
|
||||
>>> word_frequency('café', 'en') * 1e6
|
||||
3.890451449942805
|
||||
>>> word_frequency('café', 'en')
|
||||
5.89e-06
|
||||
|
||||
>>> word_frequency('cafe', 'fr') * 1e6
|
||||
1.4454397707459279
|
||||
>>> word_frequency('cafe', 'fr')
|
||||
1.51e-06
|
||||
|
||||
>>> word_frequency('café', 'fr') * 1e6
|
||||
53.70317963702532
|
||||
>>> word_frequency('café', 'fr')
|
||||
5.25e-05
|
||||
|
||||
|
||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||
@ -74,13 +75,13 @@ one occurrence per billion words.
|
||||
7.77
|
||||
|
||||
>>> zipf_frequency('word', 'en')
|
||||
5.32
|
||||
5.29
|
||||
|
||||
>>> zipf_frequency('frequency', 'en')
|
||||
4.38
|
||||
4.42
|
||||
|
||||
>>> zipf_frequency('zipf', 'en')
|
||||
1.32
|
||||
1.55
|
||||
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||
0.0
|
||||
@ -102,45 +103,39 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
||||
ends.
|
||||
|
||||
Other functions:
|
||||
|
||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||
way that the words in wordfreq's data were counted in the first place. See
|
||||
*Tokenization*.
|
||||
## Frequency bins
|
||||
|
||||
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
||||
the list, in descending frequency order.
|
||||
wordfreq's wordlists are designed to load quickly and take up little space in
|
||||
the repository. We accomplish this by avoiding meaningless precision and
|
||||
packing the words into frequency bins.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
>>> top_n_list('en', 10)
|
||||
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
In wordfreq, all words that have the same Zipf frequency rounded to the nearest
|
||||
hundredth have the same frequency. We don't store any more precision than that.
|
||||
So instead of having to store that the frequency of a word is
|
||||
.000011748975549395302, where most of those digits are meaningless, we just store
|
||||
the frequency bins and the words they contain.
|
||||
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||
Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
||||
precision no matter how far down you are in the word list. The frequency of any
|
||||
word is precise to within 1%.
|
||||
|
||||
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
(This is not a claim about _accuracy_, but about _precision_. We believe that
|
||||
the way we use multiple data sources and discard outliers makes wordfreq a
|
||||
more accurate measurement of the way these words are really used in written
|
||||
language, but it's unclear how one would measure this accuracy.)
|
||||
|
||||
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||
words and don't need the wrapper that `word_frequency` provides.
|
||||
|
||||
`supported_languages(wordlist='best')` returns a dictionary whose keys are
|
||||
language codes, and whose values are the data file that will be loaded to
|
||||
provide the requested wordlist in each language.
|
||||
## The figure-skating metric
|
||||
|
||||
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||
will select each random word from 2^n words.
|
||||
We combine word frequencies from different sources in a way that's designed
|
||||
to minimize the impact of outliers. The method reminds me of the scoring system
|
||||
in Olympic figure skating:
|
||||
|
||||
If you happen to want an easy way to get [a memorable, xkcd-style
|
||||
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
||||
job. In this case, you should actually run the similar function
|
||||
`random_ascii_words`, limiting the selection to words that can be typed in
|
||||
ASCII. But maybe you should just use [xkpa][].
|
||||
|
||||
[xkcd936]: https://xkcd.com/936/
|
||||
[xkpa]: https://github.com/beala/xkcd-password
|
||||
- Find the frequency of each word according to each data source.
|
||||
- For each word, drop the sources that give it the highest and lowest frequency.
|
||||
- Average the remaining frequencies.
|
||||
- Rescale the resulting frequency list to add up to 1.
|
||||
|
||||
|
||||
## Sources and supported languages
|
||||
@ -155,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves
|
||||
come from multiple sources:
|
||||
|
||||
- **Wikipedia**, representing encyclopedic text
|
||||
- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX
|
||||
- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
|
||||
- **News**, from NewsCrawl 2014 and GlobalVoices
|
||||
- **Books**, from Google Books Ngrams 2012
|
||||
- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus
|
||||
- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK
|
||||
Hungarian Webcorpus
|
||||
- **Twitter**, representing short-form social media
|
||||
- **Reddit**, representing potentially longer Internet comments
|
||||
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
|
||||
that comes with the Jieba word segmenter, whose provenance we don't really know
|
||||
that comes with the Jieba word segmenter, whose provenance we don't really
|
||||
know
|
||||
|
||||
The following languages are supported, with reasonable tokenization and at
|
||||
least 3 different sources of word frequencies:
|
||||
@ -224,6 +221,52 @@ between 1.0 and 3.0. These are available in 14 languages that are covered by
|
||||
enough data sources.
|
||||
|
||||
|
||||
## Other functions
|
||||
|
||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||
way that the words in wordfreq's data were counted in the first place. See
|
||||
*Tokenization*.
|
||||
|
||||
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
||||
the list, in descending frequency order.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
>>> top_n_list('en', 10)
|
||||
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||
|
||||
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
|
||||
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||
words and don't need the wrapper that `word_frequency` provides.
|
||||
|
||||
`available_languages(wordlist='best')` returns a dictionary whose keys are
|
||||
language codes, and whose values are the data file that will be loaded to
|
||||
provide the requested wordlist in each language.
|
||||
|
||||
`get_language_info(lang)` returns a dictionary of information about how we
|
||||
preprocess text in this language, such as what script we expect it to be
|
||||
written in, which characters we normalize together, and how we tokenize it.
|
||||
See its docstring for more information.
|
||||
|
||||
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||
will select each random word from 2^n words.
|
||||
|
||||
If you happen to want an easy way to get [a memorable, xkcd-style
|
||||
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
||||
job. In this case, you should actually run the similar function
|
||||
`random_ascii_words`, limiting the selection to words that can be typed in
|
||||
ASCII. But maybe you should just use [xkpa][].
|
||||
|
||||
[xkcd936]: https://xkcd.com/936/
|
||||
[xkpa]: https://github.com/beala/xkcd-password
|
||||
|
||||
|
||||
## Tokenization
|
||||
|
||||
wordfreq uses the Python package `regex`, which is a more advanced
|
||||
@ -255,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break
|
||||
into multiple tokens:
|
||||
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.35
|
||||
5.28
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.54
|
||||
3.57
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
@ -272,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
|
||||
their frequency:
|
||||
|
||||
>>> zipf_frequency('owl-flavored', 'en')
|
||||
3.18
|
||||
3.2
|
||||
|
||||
|
||||
## Multi-script languages
|
||||
@ -411,14 +454,14 @@ sources:
|
||||
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
||||
Studies (http://corpus.leeds.ac.uk/list.html)
|
||||
|
||||
- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
|
||||
(https://invokeit.wordpress.com/frequency-word-lists/)
|
||||
|
||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||
|
||||
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
|
||||
|
||||
It contains data from OPUS OpenSubtitles 2018
|
||||
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
|
||||
OpenSubtitles project (http://www.opensubtitles.org/).
|
||||
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
|
||||
attribution to OpenSubtitles.
|
||||
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||
@ -483,10 +526,6 @@ The same citation in BibTex format:
|
||||
Methods, 41 (4), 977-990.
|
||||
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
|
||||
|
||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
|
||||
(2015). The word frequency effect. Experimental Psychology.
|
||||
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
|
||||
|
||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
||||
(2011). The word frequency effect: A review of recent developments and
|
||||
implications for the choice of frequency estimates in German. Experimental
|
||||
@ -496,9 +535,6 @@ The same citation in BibTex format:
|
||||
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
||||
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
||||
|
||||
- Dave, H. (2011). Frequency word lists.
|
||||
https://invokeit.wordpress.com/frequency-word-lists/
|
||||
|
||||
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
||||
http://unicode.org/reports/tr29/
|
||||
|
||||
@ -516,11 +552,19 @@ The same citation in BibTex format:
|
||||
analyzer.
|
||||
http://mecab.sourceforge.net/
|
||||
|
||||
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
||||
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
||||
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
||||
http://aclweb.org/anthology/P12-3029
|
||||
|
||||
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
||||
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
||||
International Conference on Language Resources and Evaluation (LREC 2016).
|
||||
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
||||
|
||||
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
||||
European Languages. https://paracrawl.eu/
|
||||
|
||||
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||
|
@ -1,2 +1,2 @@
|
||||
[pytest]
|
||||
addopts = --doctest-modules
|
||||
addopts = --doctest-modules --doctest-glob=README.md
|
||||
|
@ -162,7 +162,7 @@ def test_phrase_freq():
|
||||
ff = word_frequency("flip-flop", 'en')
|
||||
assert ff > 0
|
||||
phrase_freq = 1.0 / word_frequency('flip', 'en') + 1.0 / word_frequency('flop', 'en')
|
||||
assert 1.0 / ff == pytest.approx(phrase_freq)
|
||||
assert 1.0 / ff == pytest.approx(phrase_freq, rel=0.01)
|
||||
|
||||
|
||||
def test_not_really_random():
|
||||
|
@ -49,11 +49,11 @@ def test_combination():
|
||||
gozai_freq = word_frequency('ござい', 'ja')
|
||||
masu_freq = word_frequency('ます', 'ja')
|
||||
|
||||
assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2)
|
||||
assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2, rel=0.01)
|
||||
|
||||
assert (
|
||||
1.0 / word_frequency('おはようございます', 'ja') ==
|
||||
pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq)
|
||||
pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq, rel=0.01)
|
||||
)
|
||||
|
||||
|
||||
|
@ -10,9 +10,9 @@ def test_combination():
|
||||
gamsa_freq = word_frequency('감사', 'ko')
|
||||
habnida_freq = word_frequency('합니다', 'ko')
|
||||
|
||||
assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2)
|
||||
assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2, rel=0.01)
|
||||
assert (
|
||||
1.0 / word_frequency('감사합니다', 'ko') ==
|
||||
pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq)
|
||||
pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01)
|
||||
)
|
||||
|
||||
|
@ -249,7 +249,14 @@ def _word_frequency(word, lang, wordlist, minimum):
|
||||
# probability for each word break that was inferred.
|
||||
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
|
||||
|
||||
return max(freq, minimum)
|
||||
# All our frequency data is only precise to within 1% anyway, so round
|
||||
# it to 3 significant digits
|
||||
unrounded = max(freq, minimum)
|
||||
if unrounded == 0.:
|
||||
return 0.
|
||||
else:
|
||||
leading_zeroes = math.floor(-math.log(unrounded, 10))
|
||||
return round(unrounded, leading_zeroes + 3)
|
||||
|
||||
|
||||
def word_frequency(word, lang, wordlist='best', minimum=0.):
|
||||
|
Loading…
Reference in New Issue
Block a user