Add and document large wordlists

Former-commit-id: d79ee37da9
2024-12-23 17:31:41 +00:00 · 2016-01-22 16:23:43 -05:00 · 2016-01-22 16:23:43 -05:00 · 6344b38194
commit 6344b38194
parent 12e779fc79
6 changed files with 58 additions and 21 deletions
--- a/README.md
+++ b/README.md
@ -39,11 +39,18 @@ For example:
 ## Usage

 wordfreq provides access to estimates of the frequency with which a word is
-used, in 18 languages (see *Supported languages* below). It loads
-efficiently-packed data structures that contain all words that appear at least
-once per million words.
+used, in 18 languages (see *Supported languages* below).

-The most useful function is:
+It provides three kinds of pre-built wordlists:
+
+- `'combined'` lists, containing words that appear at least once per
+  million words, averaged across all data sources.
+- `'twitter'` lists, containing words that appear at least once per
+  million words on Twitter alone.
+- `'large'` lists, containing words that appear at least once per 100
+  million words, averaged across all data sources.
+
+The most straightforward function is:

    word_frequency(word, lang, wordlist='combined', minimum=0.0)

@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
    >>> word_frequency('café', 'fr') * 1e6
    77.62471166286912

-The parameters are:
+
+`zipf_frequency` is a variation on `word_frequency` that aims to return the
+word frequency on a human-friendly logarithmic scale. The Zipf scale was
+proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
+of a word is the base-10 logarithm of the number of times it appears per
+billion words. A word with Zipf value 6 appears once per thousand words, for
+example, and a word with Zipf value 3 appears once per million words.
+
+Reasonable Zipf values are between 0 and 8, but because of the cutoffs
+described above, the minimum Zipf value appearing in these lists is 1.0 for the
+'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
+for words that do not appear in the given wordlist, although it should mean
+one occurrence per billion words.
+
+    >>> zipf_frequency('the', 'en')
+    7.59
+
+    >>> zipf_frequency('word', 'en')
+    5.34
+
+    >>> zipf_frequency('frequency', 'en')
+    4.44
+
+    >>> zipf_frequency('zipf', 'en')
+    0.0
+
+    >>> zipf_frequency('zipf', 'en', 'large')
+    1.42
+
+
+The parameters to `word_frequency` and `zipf_frequency` are:

 - `word`: a Unicode string containing the word to look up. Ideally the word
  is a single token according to our tokenizer, but if not, there is still
@ -73,21 +110,18 @@ The parameters are:
 - `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.

 - `wordlist`: which set of word frequencies to use. Current options are
-  'combined', which combines up to five different sources, and
-  'twitter', which returns frequencies observed on Twitter alone.
+  'combined', 'twitter', and 'large'.

 - `minimum`: If the word is not in the list or has a frequency lower than
-  `minimum`, return `minimum` instead. In some applications, you'll want
-  to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
-  a frequency of 1e-6 (1 per million) is the threshold for being included in
-  the list at all.
+  `minimum`, return `minimum` instead. You may want to set this to the minimum
+  value contained in the wordlist, to avoid a discontinuity where the wordlist
+  ends.

 Other functions:

 `tokenize(text, lang)` splits text in the given language into words, in the same
 way that the words in wordfreq's data were counted in the first place. See
-*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
-to be installed.
+*Tokenization*.

 `top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
 the list, in descending frequency order.
@ -168,6 +202,8 @@ it, but we have too few data sources for it so far:
    ──────────────────┼───────────────────────────────────────
    Korean      ko    │ -       -       -       Yes     Yes

+The 'large' wordlists are available in English, Spanish, French, and Portuguese.
+
 [1] We've counted the frequencies from tweets in German, such as they are, but
 you should be aware that German is not a frequently-used language on Twitter.
 Germans just don't tweet that much.
@ -179,7 +215,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
 implementation of regular expressions than the standard library, to
 separate text into tokens that can be counted consistently. `regex`
 produces tokens that follow the recommendations in [Unicode
-Annex #29, Text Segmentation][uax29].
+Annex #29, Text Segmentation][uax29], including the optional rule that
+splits words between apostrophes and vowels.

 There are language-specific exceptions:

@ -199,10 +236,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
 also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

-    >>> word_frequency('New York', 'en')
-    0.0002315934248950231
-    >>> word_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.2187603965715087e-06
+    >>> zipf_frequency('New York', 'en')
+    5.31
+    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
+    3.51

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -216,8 +253,8 @@ frequencies, because that would assume they are statistically unrelated. So if
 you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

-    >>> word_frequency('owl-flavored', 'en')
-    1.3557098723512335e-06
+    >>> zipf_frequency('owl-flavored', 'en')
+    3.18


 ## License
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq_builder/wordfreq_builder/config.py
+++ b/wordfreq_builder/wordfreq_builder/config.py
@ -56,7 +56,7 @@ CONFIG = {
        'reddit': 'generated/reddit/reddit_{lang}.{ext}',
        'combined': 'generated/combined/combined_{lang}.{ext}',
        'combined-dist': 'dist/combined_{lang}.{ext}',
-        'combined-dist-large': 'dist/combined-large_{lang}.{ext}',
+        'combined-dist-large': 'dist/large_{lang}.{ext}',
        'twitter-dist': 'dist/twitter_{lang}.{ext}',
        'jieba-dist': 'dist/jieba_{lang}.{ext}'
    },