Merge pull request #58 from LuminosoInsight/significant-figures

Round wordfreq output to 3 sig. figs, and update documentation
2024-12-23 09:21:37 +00:00 · 2018-06-25 18:53:39 -04:00 · 2018-06-25 18:53:39 -04:00 · 79caa526c3
commit 79caa526c3
parent a95b360563 fdf064b234
7 changed files with 137 additions and 69 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,20 @@
+## Version 2.1 (2018-06-18)
+
+Data changes:
+
+- Updated to the data from the latest Exquisite Corpus, which adds the
+  ParaCrawl web crawl and updates to OpenSubtitles 2018
+- Added small word list for Latvian
+- Added large word list for Czech
+- The Dutch large word list once again has 5 data sources
+
+Library change:
+
+- The output of `word_frequency` is rounded to three significant digits. This
+  provides friendlier output, and better reflects the precision of the
+  underlying data anyway.
+
+
 ## Version 2.0.1 (2018-05-01)

 Fixed edge cases that inserted spurious token boundaries when Japanese text is
--- a/README.md
+++ b/README.md
@ -1,4 +1,5 @@
-wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.
+wordfreq is a Python library for looking up the frequencies of words in many
+languages, based on many sources of data.

 Author: Robyn Speer

@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
 ## Usage

 wordfreq provides access to estimates of the frequency with which a word is
-used, in 35 languages (see *Supported languages* below).
+used, in 36 languages (see *Supported languages* below). It uses many different
+data sources, not just one corpus.

 It provides both 'small' and 'large' wordlists:

@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is:
    word_frequency(word, lang, wordlist='best', minimum=0.0)

 This function looks up a word's frequency in the given language, returning its
-frequency as a decimal between 0 and 1. In these examples, we'll multiply the
-frequencies by a million (1e6) to get more readable numbers:
+frequency as a decimal between 0 and 1.

    >>> from wordfreq import word_frequency
-    >>> word_frequency('cafe', 'en') * 1e6
-    11.748975549395302
+    >>> word_frequency('cafe', 'en')
+    1.07e-05

-    >>> word_frequency('café', 'en') * 1e6
-    3.890451449942805
+    >>> word_frequency('café', 'en')
+    5.89e-06

-    >>> word_frequency('cafe', 'fr') * 1e6
-    1.4454397707459279
+    >>> word_frequency('cafe', 'fr')
+    1.51e-06

-    >>> word_frequency('café', 'fr') * 1e6
-    53.70317963702532
+    >>> word_frequency('café', 'fr')
+    5.25e-05


 `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -74,13 +75,13 @@ one occurrence per billion words.
    7.77

    >>> zipf_frequency('word', 'en')
-    5.32
+    5.29

    >>> zipf_frequency('frequency', 'en')
-    4.38
+    4.42

    >>> zipf_frequency('zipf', 'en')
-    1.32
+    1.55

    >>> zipf_frequency('zipf', 'en', wordlist='small')
    0.0
@ -102,45 +103,39 @@ The parameters to `word_frequency` and `zipf_frequency` are:
  value contained in the wordlist, to avoid a discontinuity where the wordlist
  ends.

-Other functions:

-`tokenize(text, lang)` splits text in the given language into words, in the same
-way that the words in wordfreq's data were counted in the first place. See
-*Tokenization*.
+## Frequency bins

-`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
-the list, in descending frequency order.
+wordfreq's wordlists are designed to load quickly and take up little space in
+the repository.  We accomplish this by avoiding meaningless precision and
+packing the words into frequency bins.

-    >>> from wordfreq import top_n_list
-    >>> top_n_list('en', 10)
-    ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
+In wordfreq, all words that have the same Zipf frequency rounded to the nearest
+hundredth have the same frequency. We don't store any more precision than that.
+So instead of having to store that the frequency of a word is
+.000011748975549395302, where most of those digits are meaningless, we just store
+the frequency bins and the words they contain.

-    >>> top_n_list('es', 10)
-    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
+Because the Zipf scale is a logarithmic scale, this preserves the same relative
+precision no matter how far down you are in the word list. The frequency of any
+word is precise to within 1%.

-`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
-wordlist, in descending frequency order.
+(This is not a claim about _accuracy_, but about _precision_. We believe that
+the way we use multiple data sources and discard outliers makes wordfreq a
+more accurate measurement of the way these words are really used in written
+language, but it's unclear how one would measure this accuracy.)

-`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
-a wordlist as a dictionary, for cases where you'll want to look up a lot of
-words and don't need the wrapper that `word_frequency` provides.

-`supported_languages(wordlist='best')` returns a dictionary whose keys are
-language codes, and whose values are the data file that will be loaded to
-provide the requested wordlist in each language.
+## The figure-skating metric

-`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
-returns a selection of random words, separated by spaces. `bits_per_word=n`
-will select each random word from 2^n words.
+We combine word frequencies from different sources in a way that's designed
+to minimize the impact of outliers. The method reminds me of the scoring system
+in Olympic figure skating:

-If you happen to want an easy way to get [a memorable, xkcd-style
-password][xkcd936] with 60 bits of entropy, this function will almost do the
-job. In this case, you should actually run the similar function
-`random_ascii_words`, limiting the selection to words that can be typed in
-ASCII. But maybe you should just use [xkpa][].
-
-[xkcd936]: https://xkcd.com/936/
-[xkpa]: https://github.com/beala/xkcd-password
+- Find the frequency of each word according to each data source.
+- For each word, drop the sources that give it the highest and lowest frequency.
+- Average the remaining frequencies.
+- Rescale the resulting frequency list to add up to 1.


 ## Sources and supported languages
@ -155,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves
 come from multiple sources:

 - **Wikipedia**, representing encyclopedic text
- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX
+- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
 - **News**, from NewsCrawl 2014 and GlobalVoices
 - **Books**, from Google Books Ngrams 2012
- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus
+- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK
+  Hungarian Webcorpus
 - **Twitter**, representing short-form social media
 - **Reddit**, representing potentially longer Internet comments
 - **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
-  that comes with the Jieba word segmenter, whose provenance we don't really know
+  that comes with the Jieba word segmenter, whose provenance we don't really
+  know

 The following languages are supported, with reasonable tokenization and at
 least 3 different sources of word frequencies:
@ -224,6 +221,52 @@ between 1.0 and 3.0. These are available in 14 languages that are covered by
 enough data sources.


+## Other functions
+
+`tokenize(text, lang)` splits text in the given language into words, in the same
+way that the words in wordfreq's data were counted in the first place. See
+*Tokenization*.
+
+`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
+the list, in descending frequency order.
+
+    >>> from wordfreq import top_n_list
+    >>> top_n_list('en', 10)
+    ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
+
+    >>> top_n_list('es', 10)
+    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
+
+`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
+wordlist, in descending frequency order.
+
+`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
+a wordlist as a dictionary, for cases where you'll want to look up a lot of
+words and don't need the wrapper that `word_frequency` provides.
+
+`available_languages(wordlist='best')` returns a dictionary whose keys are
+language codes, and whose values are the data file that will be loaded to
+provide the requested wordlist in each language.
+
+`get_language_info(lang)` returns a dictionary of information about how we
+preprocess text in this language, such as what script we expect it to be
+written in, which characters we normalize together, and how we tokenize it.
+See its docstring for more information.
+
+`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
+returns a selection of random words, separated by spaces. `bits_per_word=n`
+will select each random word from 2^n words.
+
+If you happen to want an easy way to get [a memorable, xkcd-style
+password][xkcd936] with 60 bits of entropy, this function will almost do the
+job. In this case, you should actually run the similar function
+`random_ascii_words`, limiting the selection to words that can be typed in
+ASCII. But maybe you should just use [xkpa][].
+
+[xkcd936]: https://xkcd.com/936/
+[xkpa]: https://github.com/beala/xkcd-password
+
+
 ## Tokenization

 wordfreq uses the Python package `regex`, which is a more advanced
@ -255,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

    >>> zipf_frequency('New York', 'en')
-    5.35
+    5.28
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.54
+    3.57

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -272,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

    >>> zipf_frequency('owl-flavored', 'en')
-    3.18
+    3.2


 ## Multi-script languages
@ -411,14 +454,14 @@ sources:
 - The Leeds Internet Corpus, from the University of Leeds Centre for Translation
  Studies (http://corpus.leeds.ac.uk/list.html)

- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
-  (https://invokeit.wordpress.com/frequency-word-lists/)
-
 - Wikipedia, the free encyclopedia (http://www.wikipedia.org)

+- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
+
 It contains data from OPUS OpenSubtitles 2018
 (http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
-OpenSubtitles project (http://www.opensubtitles.org/).
+OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
+attribution to OpenSubtitles.

 It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
 SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
@ -483,10 +526,6 @@ The same citation in BibTex format:
  Methods, 41 (4), 977-990.
  http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf

- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
-  (2015). The word frequency effect. Experimental Psychology.
-  http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
-
 - Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
  (2011). The word frequency effect: A review of recent developments and
  implications for the choice of frequency estimates in German. Experimental
@ -496,9 +535,6 @@ The same citation in BibTex format:
  frequencies based on film subtitles. PLoS One, 5(6), e10729.
  http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729

- Dave, H. (2011). Frequency word lists.
-  https://invokeit.wordpress.com/frequency-word-lists/
-
 - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
  http://unicode.org/reports/tr29/

@ -516,11 +552,19 @@ The same citation in BibTex format:
  analyzer.
  http://mecab.sourceforge.net/

+- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
+  S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
+  Proceedings of the ACL 2012 system demonstrations, 169-174.
+  http://aclweb.org/anthology/P12-3029
+
 - Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
  Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
  International Conference on Language Resources and Evaluation (LREC 2016).
  http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf

+- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
+  European Languages. https://paracrawl.eu/
+
 - van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
  SUBTLEX-UK: A new and improved word frequency database for British English.
  The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
--- a/pytest.ini
+++ b/pytest.ini
@ -1,2 +1,2 @@
 [pytest]
-addopts = --doctest-modules
+addopts = --doctest-modules --doctest-glob=README.md
--- a/tests/test_general.py
+++ b/tests/test_general.py
@ -162,7 +162,7 @@ def test_phrase_freq():
    ff = word_frequency("flip-flop", 'en')
    assert ff > 0
    phrase_freq = 1.0 / word_frequency('flip', 'en') + 1.0 / word_frequency('flop', 'en')
-    assert 1.0 / ff == pytest.approx(phrase_freq)
+    assert 1.0 / ff == pytest.approx(phrase_freq, rel=0.01)


 def test_not_really_random():
--- a/tests/test_japanese.py
+++ b/tests/test_japanese.py
@ -49,11 +49,11 @@ def test_combination():
    gozai_freq = word_frequency('ござい', 'ja')
    masu_freq = word_frequency('ます', 'ja')

-    assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2)
+    assert word_frequency('おはようおはよう', 'ja') == pytest.approx(ohayou_freq / 2, rel=0.01)
    
    assert (
        1.0 / word_frequency('おはようございます', 'ja') ==
-        pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq)
+        pytest.approx(1.0 / ohayou_freq + 1.0 / gozai_freq + 1.0 / masu_freq, rel=0.01)
    )
    

--- a/tests/test_korean.py
+++ b/tests/test_korean.py
@ -10,9 +10,9 @@ def test_combination():
    gamsa_freq = word_frequency('감사', 'ko')
    habnida_freq = word_frequency('합니다', 'ko')

-    assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2)
+    assert word_frequency('감사감사', 'ko') == pytest.approx(gamsa_freq / 2, rel=0.01)
    assert (
        1.0 / word_frequency('감사합니다', 'ko') ==
-        pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq)
+        pytest.approx(1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01)
    )

--- a/wordfreq/init.py
+++ b/wordfreq/init.py
@ -249,7 +249,14 @@ def _word_frequency(word, lang, wordlist, minimum):
        # probability for each word break that was inferred.
        freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)

-    return max(freq, minimum)
+    # All our frequency data is only precise to within 1% anyway, so round
+    # it to 3 significant digits
+    unrounded = max(freq, minimum)
+    if unrounded == 0.:
+        return 0.
+    else:
+        leading_zeroes = math.floor(-math.log(unrounded, 10))
+        return round(unrounded, leading_zeroes + 3)


 def word_frequency(word, lang, wordlist='best', minimum=0.):