mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
update the README
This commit is contained in:
parent
d8e3669a73
commit
c5f64a5de8
49
README.md
49
README.md
@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 27 languages (see *Supported languages* below).
|
used, in 35 languages (see *Supported languages* below).
|
||||||
|
|
||||||
It provides three kinds of pre-built wordlists:
|
It provides both 'small' and 'large' wordlists:
|
||||||
|
|
||||||
- `'combined'` lists, containing words that appear at least once per
|
- The 'small' lists take up very little memory and cover words that appear at
|
||||||
million words, averaged across all data sources.
|
least once per million words.
|
||||||
- `'twitter'` lists, containing words that appear at least once per
|
- The 'large' lists cover words that appear at least once per 100 million
|
||||||
million words on Twitter alone.
|
words.
|
||||||
- `'large'` lists, containing words that appear at least once per 100
|
|
||||||
million words, averaged across all data sources.
|
|
||||||
|
|
||||||
The most straightforward function is:
|
The default list is 'best', which uses 'large' if it's available for the
|
||||||
|
language, and 'small' otherwise.
|
||||||
|
|
||||||
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
The most straightforward function for looking up frequencies is:
|
||||||
|
|
||||||
|
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
||||||
|
|
||||||
This function looks up a word's frequency in the given language, returning its
|
This function looks up a word's frequency in the given language, returning its
|
||||||
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
||||||
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
|
|||||||
11.748975549395302
|
11.748975549395302
|
||||||
|
|
||||||
>>> word_frequency('café', 'en') * 1e6
|
>>> word_frequency('café', 'en') * 1e6
|
||||||
3.981071705534969
|
3.890451449942805
|
||||||
|
|
||||||
>>> word_frequency('cafe', 'fr') * 1e6
|
>>> word_frequency('cafe', 'fr') * 1e6
|
||||||
1.4125375446227555
|
1.4454397707459279
|
||||||
|
|
||||||
>>> word_frequency('café', 'fr') * 1e6
|
>>> word_frequency('café', 'fr') * 1e6
|
||||||
53.70317963702532
|
53.70317963702532
|
||||||
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
|
|||||||
|
|
||||||
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
||||||
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||||
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
|
'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
|
||||||
for words that do not appear in the given wordlist, although it should mean
|
for words that do not appear in the given wordlist, although it should mean
|
||||||
one occurrence per billion words.
|
one occurrence per billion words.
|
||||||
|
|
||||||
>>> from wordfreq import zipf_frequency
|
>>> from wordfreq import zipf_frequency
|
||||||
>>> zipf_frequency('the', 'en')
|
>>> zipf_frequency('the', 'en')
|
||||||
7.75
|
7.77
|
||||||
|
|
||||||
>>> zipf_frequency('word', 'en')
|
>>> zipf_frequency('word', 'en')
|
||||||
5.32
|
5.32
|
||||||
|
|
||||||
>>> zipf_frequency('frequency', 'en')
|
>>> zipf_frequency('frequency', 'en')
|
||||||
4.36
|
4.38
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en')
|
>>> zipf_frequency('zipf', 'en')
|
||||||
0.0
|
1.32
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||||
1.28
|
0.0
|
||||||
|
|
||||||
|
|
||||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||||
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
|||||||
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
||||||
|
|
||||||
- `wordlist`: which set of word frequencies to use. Current options are
|
- `wordlist`: which set of word frequencies to use. Current options are
|
||||||
'combined', 'twitter', and 'large'.
|
'small', 'large', and 'best'.
|
||||||
|
|
||||||
- `minimum`: If the word is not in the list or has a frequency lower than
|
- `minimum`: If the word is not in the list or has a frequency lower than
|
||||||
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
||||||
@ -108,7 +109,7 @@ Other functions:
|
|||||||
way that the words in wordfreq's data were counted in the first place. See
|
way that the words in wordfreq's data were counted in the first place. See
|
||||||
*Tokenization*.
|
*Tokenization*.
|
||||||
|
|
||||||
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
||||||
the list, in descending frequency order.
|
the list, in descending frequency order.
|
||||||
|
|
||||||
>>> from wordfreq import top_n_list
|
>>> from wordfreq import top_n_list
|
||||||
@ -118,18 +119,18 @@ the list, in descending frequency order.
|
|||||||
>>> top_n_list('es', 10)
|
>>> top_n_list('es', 10)
|
||||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||||
|
|
||||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
||||||
wordlist, in descending frequency order.
|
wordlist, in descending frequency order.
|
||||||
|
|
||||||
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
|
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
||||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||||
words and don't need the wrapper that `word_frequency` provides.
|
words and don't need the wrapper that `word_frequency` provides.
|
||||||
|
|
||||||
`supported_languages(wordlist='combined')` returns a dictionary whose keys are
|
`supported_languages(wordlist='best')` returns a dictionary whose keys are
|
||||||
language codes, and whose values are the data file that will be loaded to
|
language codes, and whose values are the data file that will be loaded to
|
||||||
provide the requested wordlist in each language.
|
provide the requested wordlist in each language.
|
||||||
|
|
||||||
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
|
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||||
will select each random word from 2^n words.
|
will select each random word from 2^n words.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user