update the README

This commit is contained in:
Robyn Speer 2018-03-08 18:16:15 -05:00
parent d8e3669a73
commit c5f64a5de8

View File

@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
used, in 27 languages (see *Supported languages* below). used, in 35 languages (see *Supported languages* below).
It provides three kinds of pre-built wordlists: It provides both 'small' and 'large' wordlists:
- `'combined'` lists, containing words that appear at least once per - The 'small' lists take up very little memory and cover words that appear at
million words, averaged across all data sources. least once per million words.
- `'twitter'` lists, containing words that appear at least once per - The 'large' lists cover words that appear at least once per 100 million
million words on Twitter alone. words.
- `'large'` lists, containing words that appear at least once per 100
million words, averaged across all data sources.
The most straightforward function is: The default list is 'best', which uses 'large' if it's available for the
language, and 'small' otherwise.
word_frequency(word, lang, wordlist='combined', minimum=0.0) The most straightforward function for looking up frequencies is:
word_frequency(word, lang, wordlist='best', minimum=0.0)
This function looks up a word's frequency in the given language, returning its This function looks up a word's frequency in the given language, returning its
frequency as a decimal between 0 and 1. In these examples, we'll multiply the frequency as a decimal between 0 and 1. In these examples, we'll multiply the
@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers:
11.748975549395302 11.748975549395302
>>> word_frequency('café', 'en') * 1e6 >>> word_frequency('café', 'en') * 1e6
3.981071705534969 3.890451449942805
>>> word_frequency('cafe', 'fr') * 1e6 >>> word_frequency('cafe', 'fr') * 1e6
1.4125375446227555 1.4454397707459279
>>> word_frequency('café', 'fr') * 1e6 >>> word_frequency('café', 'fr') * 1e6
53.70317963702532 53.70317963702532
@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words.
Reasonable Zipf values are between 0 and 8, but because of the cutoffs Reasonable Zipf values are between 0 and 8, but because of the cutoffs
described above, the minimum Zipf value appearing in these lists is 1.0 for the described above, the minimum Zipf value appearing in these lists is 1.0 for the
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value 'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
for words that do not appear in the given wordlist, although it should mean for words that do not appear in the given wordlist, although it should mean
one occurrence per billion words. one occurrence per billion words.
>>> from wordfreq import zipf_frequency >>> from wordfreq import zipf_frequency
>>> zipf_frequency('the', 'en') >>> zipf_frequency('the', 'en')
7.75 7.77
>>> zipf_frequency('word', 'en') >>> zipf_frequency('word', 'en')
5.32 5.32
>>> zipf_frequency('frequency', 'en') >>> zipf_frequency('frequency', 'en')
4.36 4.38
>>> zipf_frequency('zipf', 'en') >>> zipf_frequency('zipf', 'en')
0.0 1.32
>>> zipf_frequency('zipf', 'en', wordlist='large') >>> zipf_frequency('zipf', 'en', wordlist='small')
1.28 0.0
The parameters to `word_frequency` and `zipf_frequency` are: The parameters to `word_frequency` and `zipf_frequency` are:
@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are:
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'. - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
- `wordlist`: which set of word frequencies to use. Current options are - `wordlist`: which set of word frequencies to use. Current options are
'combined', 'twitter', and 'large'. 'small', 'large', and 'best'.
- `minimum`: If the word is not in the list or has a frequency lower than - `minimum`: If the word is not in the list or has a frequency lower than
`minimum`, return `minimum` instead. You may want to set this to the minimum `minimum`, return `minimum` instead. You may want to set this to the minimum
@ -108,7 +109,7 @@ Other functions:
way that the words in wordfreq's data were counted in the first place. See way that the words in wordfreq's data were counted in the first place. See
*Tokenization*. *Tokenization*.
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in `top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
the list, in descending frequency order. the list, in descending frequency order.
>>> from wordfreq import top_n_list >>> from wordfreq import top_n_list
@ -118,18 +119,18 @@ the list, in descending frequency order.
>>> top_n_list('es', 10) >>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se'] ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a `iter_wordlist(lang, wordlist='best')` iterates through all the words in a
wordlist, in descending frequency order. wordlist, in descending frequency order.
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in `get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides. words and don't need the wrapper that `word_frequency` provides.
`supported_languages(wordlist='combined')` returns a dictionary whose keys are `supported_languages(wordlist='best')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language. provide the requested wordlist in each language.
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)` `random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n` returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words. will select each random word from 2^n words.