From c5f64a5de8fa6002fce2330371e1d2901e10acb6 Mon Sep 17 00:00:00 2001 From: Robyn Speer Date: Thu, 8 Mar 2018 18:16:15 -0500 Subject: [PATCH] update the README --- README.md | 49 +++++++++++++++++++++++++------------------------ 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index e1b4729..aa5e9a6 100644 --- a/README.md +++ b/README.md @@ -23,20 +23,21 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies. ## Usage wordfreq provides access to estimates of the frequency with which a word is -used, in 27 languages (see *Supported languages* below). +used, in 35 languages (see *Supported languages* below). -It provides three kinds of pre-built wordlists: +It provides both 'small' and 'large' wordlists: -- `'combined'` lists, containing words that appear at least once per - million words, averaged across all data sources. -- `'twitter'` lists, containing words that appear at least once per - million words on Twitter alone. -- `'large'` lists, containing words that appear at least once per 100 - million words, averaged across all data sources. +- The 'small' lists take up very little memory and cover words that appear at + least once per million words. +- The 'large' lists cover words that appear at least once per 100 million + words. -The most straightforward function is: +The default list is 'best', which uses 'large' if it's available for the +language, and 'small' otherwise. - word_frequency(word, lang, wordlist='combined', minimum=0.0) +The most straightforward function for looking up frequencies is: + + word_frequency(word, lang, wordlist='best', minimum=0.0) This function looks up a word's frequency in the given language, returning its frequency as a decimal between 0 and 1. In these examples, we'll multiply the @@ -47,10 +48,10 @@ frequencies by a million (1e6) to get more readable numbers: 11.748975549395302 >>> word_frequency('café', 'en') * 1e6 - 3.981071705534969 + 3.890451449942805 >>> word_frequency('cafe', 'fr') * 1e6 - 1.4125375446227555 + 1.4454397707459279 >>> word_frequency('café', 'fr') * 1e6 53.70317963702532 @@ -65,25 +66,25 @@ example, and a word with Zipf value 3 appears once per million words. Reasonable Zipf values are between 0 and 8, but because of the cutoffs described above, the minimum Zipf value appearing in these lists is 1.0 for the -'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value +'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value for words that do not appear in the given wordlist, although it should mean one occurrence per billion words. >>> from wordfreq import zipf_frequency >>> zipf_frequency('the', 'en') - 7.75 + 7.77 >>> zipf_frequency('word', 'en') 5.32 >>> zipf_frequency('frequency', 'en') - 4.36 + 4.38 >>> zipf_frequency('zipf', 'en') - 0.0 + 1.32 - >>> zipf_frequency('zipf', 'en', wordlist='large') - 1.28 + >>> zipf_frequency('zipf', 'en', wordlist='small') + 0.0 The parameters to `word_frequency` and `zipf_frequency` are: @@ -95,7 +96,7 @@ The parameters to `word_frequency` and `zipf_frequency` are: - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'. - `wordlist`: which set of word frequencies to use. Current options are - 'combined', 'twitter', and 'large'. + 'small', 'large', and 'best'. - `minimum`: If the word is not in the list or has a frequency lower than `minimum`, return `minimum` instead. You may want to set this to the minimum @@ -108,7 +109,7 @@ Other functions: way that the words in wordfreq's data were counted in the first place. See *Tokenization*. -`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in +`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in the list, in descending frequency order. >>> from wordfreq import top_n_list @@ -118,18 +119,18 @@ the list, in descending frequency order. >>> top_n_list('es', 10) ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se'] -`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a +`iter_wordlist(lang, wordlist='best')` iterates through all the words in a wordlist, in descending frequency order. -`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in +`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in a wordlist as a dictionary, for cases where you'll want to look up a lot of words and don't need the wrapper that `word_frequency` provides. -`supported_languages(wordlist='combined')` returns a dictionary whose keys are +`supported_languages(wordlist='best')` returns a dictionary whose keys are language codes, and whose values are the data file that will be loaded to provide the requested wordlist in each language. -`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)` +`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)` returns a selection of random words, separated by spaces. `bits_per_word=n` will select each random word from 2^n words.