mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
parent
12e779fc79
commit
6344b38194
77
README.md
77
README.md
@ -39,11 +39,18 @@ For example:
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 18 languages (see *Supported languages* below). It loads
|
used, in 18 languages (see *Supported languages* below).
|
||||||
efficiently-packed data structures that contain all words that appear at least
|
|
||||||
once per million words.
|
|
||||||
|
|
||||||
The most useful function is:
|
It provides three kinds of pre-built wordlists:
|
||||||
|
|
||||||
|
- `'combined'` lists, containing words that appear at least once per
|
||||||
|
million words, averaged across all data sources.
|
||||||
|
- `'twitter'` lists, containing words that appear at least once per
|
||||||
|
million words on Twitter alone.
|
||||||
|
- `'large'` lists, containing words that appear at least once per 100
|
||||||
|
million words, averaged across all data sources.
|
||||||
|
|
||||||
|
The most straightforward function is:
|
||||||
|
|
||||||
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
||||||
|
|
||||||
@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
|
|||||||
>>> word_frequency('café', 'fr') * 1e6
|
>>> word_frequency('café', 'fr') * 1e6
|
||||||
77.62471166286912
|
77.62471166286912
|
||||||
|
|
||||||
The parameters are:
|
|
||||||
|
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||||
|
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
||||||
|
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
||||||
|
of a word is the base-10 logarithm of the number of times it appears per
|
||||||
|
billion words. A word with Zipf value 6 appears once per thousand words, for
|
||||||
|
example, and a word with Zipf value 3 appears once per million words.
|
||||||
|
|
||||||
|
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
||||||
|
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||||
|
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
|
||||||
|
for words that do not appear in the given wordlist, although it should mean
|
||||||
|
one occurrence per billion words.
|
||||||
|
|
||||||
|
>>> zipf_frequency('the', 'en')
|
||||||
|
7.59
|
||||||
|
|
||||||
|
>>> zipf_frequency('word', 'en')
|
||||||
|
5.34
|
||||||
|
|
||||||
|
>>> zipf_frequency('frequency', 'en')
|
||||||
|
4.44
|
||||||
|
|
||||||
|
>>> zipf_frequency('zipf', 'en')
|
||||||
|
0.0
|
||||||
|
|
||||||
|
>>> zipf_frequency('zipf', 'en', 'large')
|
||||||
|
1.42
|
||||||
|
|
||||||
|
|
||||||
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||||
|
|
||||||
- `word`: a Unicode string containing the word to look up. Ideally the word
|
- `word`: a Unicode string containing the word to look up. Ideally the word
|
||||||
is a single token according to our tokenizer, but if not, there is still
|
is a single token according to our tokenizer, but if not, there is still
|
||||||
@ -73,21 +110,18 @@ The parameters are:
|
|||||||
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
||||||
|
|
||||||
- `wordlist`: which set of word frequencies to use. Current options are
|
- `wordlist`: which set of word frequencies to use. Current options are
|
||||||
'combined', which combines up to five different sources, and
|
'combined', 'twitter', and 'large'.
|
||||||
'twitter', which returns frequencies observed on Twitter alone.
|
|
||||||
|
|
||||||
- `minimum`: If the word is not in the list or has a frequency lower than
|
- `minimum`: If the word is not in the list or has a frequency lower than
|
||||||
`minimum`, return `minimum` instead. In some applications, you'll want
|
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
||||||
to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
|
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
||||||
a frequency of 1e-6 (1 per million) is the threshold for being included in
|
ends.
|
||||||
the list at all.
|
|
||||||
|
|
||||||
Other functions:
|
Other functions:
|
||||||
|
|
||||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||||
way that the words in wordfreq's data were counted in the first place. See
|
way that the words in wordfreq's data were counted in the first place. See
|
||||||
*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
|
*Tokenization*.
|
||||||
to be installed.
|
|
||||||
|
|
||||||
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
||||||
the list, in descending frequency order.
|
the list, in descending frequency order.
|
||||||
@ -168,6 +202,8 @@ it, but we have too few data sources for it so far:
|
|||||||
──────────────────┼───────────────────────────────────────
|
──────────────────┼───────────────────────────────────────
|
||||||
Korean ko │ - - - Yes Yes
|
Korean ko │ - - - Yes Yes
|
||||||
|
|
||||||
|
The 'large' wordlists are available in English, Spanish, French, and Portuguese.
|
||||||
|
|
||||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||||
you should be aware that German is not a frequently-used language on Twitter.
|
you should be aware that German is not a frequently-used language on Twitter.
|
||||||
Germans just don't tweet that much.
|
Germans just don't tweet that much.
|
||||||
@ -179,7 +215,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
|
|||||||
implementation of regular expressions than the standard library, to
|
implementation of regular expressions than the standard library, to
|
||||||
separate text into tokens that can be counted consistently. `regex`
|
separate text into tokens that can be counted consistently. `regex`
|
||||||
produces tokens that follow the recommendations in [Unicode
|
produces tokens that follow the recommendations in [Unicode
|
||||||
Annex #29, Text Segmentation][uax29].
|
Annex #29, Text Segmentation][uax29], including the optional rule that
|
||||||
|
splits words between apostrophes and vowels.
|
||||||
|
|
||||||
There are language-specific exceptions:
|
There are language-specific exceptions:
|
||||||
|
|
||||||
@ -199,10 +236,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
|
|||||||
also try to deal gracefully when you query it with texts that actually break
|
also try to deal gracefully when you query it with texts that actually break
|
||||||
into multiple tokens:
|
into multiple tokens:
|
||||||
|
|
||||||
>>> word_frequency('New York', 'en')
|
>>> zipf_frequency('New York', 'en')
|
||||||
0.0002315934248950231
|
5.31
|
||||||
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
3.2187603965715087e-06
|
3.51
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||||
@ -216,8 +253,8 @@ frequencies, because that would assume they are statistically unrelated. So if
|
|||||||
you give it an uncommon combination of tokens, it will hugely over-estimate
|
you give it an uncommon combination of tokens, it will hugely over-estimate
|
||||||
their frequency:
|
their frequency:
|
||||||
|
|
||||||
>>> word_frequency('owl-flavored', 'en')
|
>>> zipf_frequency('owl-flavored', 'en')
|
||||||
1.3557098723512335e-06
|
3.18
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
BIN
wordfreq/data/large_en.msgpack.gz
Normal file
BIN
wordfreq/data/large_en.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_es.msgpack.gz
Normal file
BIN
wordfreq/data/large_es.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_fr.msgpack.gz
Normal file
BIN
wordfreq/data/large_fr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_pt.msgpack.gz
Normal file
BIN
wordfreq/data/large_pt.msgpack.gz
Normal file
Binary file not shown.
@ -56,7 +56,7 @@ CONFIG = {
|
|||||||
'reddit': 'generated/reddit/reddit_{lang}.{ext}',
|
'reddit': 'generated/reddit/reddit_{lang}.{ext}',
|
||||||
'combined': 'generated/combined/combined_{lang}.{ext}',
|
'combined': 'generated/combined/combined_{lang}.{ext}',
|
||||||
'combined-dist': 'dist/combined_{lang}.{ext}',
|
'combined-dist': 'dist/combined_{lang}.{ext}',
|
||||||
'combined-dist-large': 'dist/combined-large_{lang}.{ext}',
|
'combined-dist-large': 'dist/large_{lang}.{ext}',
|
||||||
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
||||||
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
||||||
},
|
},
|
||||||
|
Loading…
Reference in New Issue
Block a user