Add and document large wordlists

Former-commit-id: d79ee37da9
This commit is contained in:
Robyn Speer 2016-01-22 16:23:43 -05:00
parent 12e779fc79
commit 6344b38194
6 changed files with 58 additions and 21 deletions

View File

@ -39,11 +39,18 @@ For example:
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
used, in 18 languages (see *Supported languages* below). It loads used, in 18 languages (see *Supported languages* below).
efficiently-packed data structures that contain all words that appear at least
once per million words.
The most useful function is: It provides three kinds of pre-built wordlists:
- `'combined'` lists, containing words that appear at least once per
million words, averaged across all data sources.
- `'twitter'` lists, containing words that appear at least once per
million words on Twitter alone.
- `'large'` lists, containing words that appear at least once per 100
million words, averaged across all data sources.
The most straightforward function is:
word_frequency(word, lang, wordlist='combined', minimum=0.0) word_frequency(word, lang, wordlist='combined', minimum=0.0)
@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
>>> word_frequency('café', 'fr') * 1e6 >>> word_frequency('café', 'fr') * 1e6
77.62471166286912 77.62471166286912
The parameters are:
`zipf_frequency` is a variation on `word_frequency` that aims to return the
word frequency on a human-friendly logarithmic scale. The Zipf scale was
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
of a word is the base-10 logarithm of the number of times it appears per
billion words. A word with Zipf value 6 appears once per thousand words, for
example, and a word with Zipf value 3 appears once per million words.
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
described above, the minimum Zipf value appearing in these lists is 1.0 for the
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
for words that do not appear in the given wordlist, although it should mean
one occurrence per billion words.
>>> zipf_frequency('the', 'en')
7.59
>>> zipf_frequency('word', 'en')
5.34
>>> zipf_frequency('frequency', 'en')
4.44
>>> zipf_frequency('zipf', 'en')
0.0
>>> zipf_frequency('zipf', 'en', 'large')
1.42
The parameters to `word_frequency` and `zipf_frequency` are:
- `word`: a Unicode string containing the word to look up. Ideally the word - `word`: a Unicode string containing the word to look up. Ideally the word
is a single token according to our tokenizer, but if not, there is still is a single token according to our tokenizer, but if not, there is still
@ -73,21 +110,18 @@ The parameters are:
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'. - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
- `wordlist`: which set of word frequencies to use. Current options are - `wordlist`: which set of word frequencies to use. Current options are
'combined', which combines up to five different sources, and 'combined', 'twitter', and 'large'.
'twitter', which returns frequencies observed on Twitter alone.
- `minimum`: If the word is not in the list or has a frequency lower than - `minimum`: If the word is not in the list or has a frequency lower than
`minimum`, return `minimum` instead. In some applications, you'll want `minimum`, return `minimum` instead. You may want to set this to the minimum
to set `minimum=1e-6` to avoid a discontinuity where the list ends, because value contained in the wordlist, to avoid a discontinuity where the wordlist
a frequency of 1e-6 (1 per million) is the threshold for being included in ends.
the list at all.
Other functions: Other functions:
`tokenize(text, lang)` splits text in the given language into words, in the same `tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See way that the words in wordfreq's data were counted in the first place. See
*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3` *Tokenization*.
to be installed.
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in `top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
the list, in descending frequency order. the list, in descending frequency order.
@ -168,6 +202,8 @@ it, but we have too few data sources for it so far:
──────────────────┼─────────────────────────────────────── ──────────────────┼───────────────────────────────────────
Korean ko │ - - - Yes Yes Korean ko │ - - - Yes Yes
The 'large' wordlists are available in English, Spanish, French, and Portuguese.
[1] We've counted the frequencies from tweets in German, such as they are, but [1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter. you should be aware that German is not a frequently-used language on Twitter.
Germans just don't tweet that much. Germans just don't tweet that much.
@ -179,7 +215,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
implementation of regular expressions than the standard library, to implementation of regular expressions than the standard library, to
separate text into tokens that can be counted consistently. `regex` separate text into tokens that can be counted consistently. `regex`
produces tokens that follow the recommendations in [Unicode produces tokens that follow the recommendations in [Unicode
Annex #29, Text Segmentation][uax29]. Annex #29, Text Segmentation][uax29], including the optional rule that
splits words between apostrophes and vowels.
There are language-specific exceptions: There are language-specific exceptions:
@ -199,10 +236,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
also try to deal gracefully when you query it with texts that actually break also try to deal gracefully when you query it with texts that actually break
into multiple tokens: into multiple tokens:
>>> word_frequency('New York', 'en') >>> zipf_frequency('New York', 'en')
0.0002315934248950231 5.31
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway" >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.2187603965715087e-06 3.51
The word frequencies are combined with the half-harmonic-mean function in order The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese, to provide an estimate of what their combined frequency would be. In Chinese,
@ -216,8 +253,8 @@ frequencies, because that would assume they are statistically unrelated. So if
you give it an uncommon combination of tokens, it will hugely over-estimate you give it an uncommon combination of tokens, it will hugely over-estimate
their frequency: their frequency:
>>> word_frequency('owl-flavored', 'en') >>> zipf_frequency('owl-flavored', 'en')
1.3557098723512335e-06 3.18
## License ## License

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -56,7 +56,7 @@ CONFIG = {
'reddit': 'generated/reddit/reddit_{lang}.{ext}', 'reddit': 'generated/reddit/reddit_{lang}.{ext}',
'combined': 'generated/combined/combined_{lang}.{ext}', 'combined': 'generated/combined/combined_{lang}.{ext}',
'combined-dist': 'dist/combined_{lang}.{ext}', 'combined-dist': 'dist/combined_{lang}.{ext}',
'combined-dist-large': 'dist/combined-large_{lang}.{ext}', 'combined-dist-large': 'dist/large_{lang}.{ext}',
'twitter-dist': 'dist/twitter_{lang}.{ext}', 'twitter-dist': 'dist/twitter_{lang}.{ext}',
'jieba-dist': 'dist/jieba_{lang}.{ext}' 'jieba-dist': 'dist/jieba_{lang}.{ext}'
}, },