From c6552f923fbecbdb9748cd38abc9dfbe933064ea Mon Sep 17 00:00:00 2001 From: Robyn Speer Date: Mon, 18 Jun 2018 15:15:07 -0400 Subject: [PATCH] update README and CHANGELOG --- CHANGELOG.md | 17 +++++++++ README.md | 99 ++++++++++++++++++++++++++++++++-------------------- 2 files changed, 79 insertions(+), 37 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b61267d..b2d0fbe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,20 @@ +## Version 2.1 (2018-06-18) + +Data changes: + +- Updated to the data from the latest Exquisite Corpus, which adds the + ParaCrawl web crawl and updates to OpenSubtitles 2018 +- Added small word list for Latvian +- Added large word list for Czech +- The Dutch large word list once again has 5 data sources + +Library change: + +- The output of `word_frequency` is rounded to three significant digits. This + provides friendlier output, and better reflects the precision of the + underlying data anyway. + + ## Version 2.0.1 (2018-05-01) Fixed edge cases that inserted spurious token boundaries when Japanese text is diff --git a/README.md b/README.md index ba8f3aa..912c0f4 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ -wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data. +wordfreq is a Python library for looking up the frequencies of words in many +languages, based on many sources of data. Author: Robyn Speer @@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies. ## Usage wordfreq provides access to estimates of the frequency with which a word is -used, in 35 languages (see *Supported languages* below). +used, in 36 languages (see *Supported languages* below). It uses many different +data sources, not just one corpus. It provides both 'small' and 'large' wordlists: @@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is: word_frequency(word, lang, wordlist='best', minimum=0.0) This function looks up a word's frequency in the given language, returning its -frequency as a decimal between 0 and 1. In these examples, we'll multiply the -frequencies by a million (1e6) to get more readable numbers: +frequency as a decimal between 0 and 1. >>> from wordfreq import word_frequency - >>> word_frequency('cafe', 'en') * 1e6 - 11.748975549395302 + >>> word_frequency('cafe', 'en') + 1.07e-05 - >>> word_frequency('café', 'en') * 1e6 - 3.890451449942805 + >>> word_frequency('café', 'en') + 5.89e-06 - >>> word_frequency('cafe', 'fr') * 1e6 - 1.4454397707459279 + >>> word_frequency('cafe', 'fr') + 1.51e-06 - >>> word_frequency('café', 'fr') * 1e6 - 53.70317963702532 + >>> word_frequency('café', 'fr') + 5.25e-05 `zipf_frequency` is a variation on `word_frequency` that aims to return the @@ -74,13 +75,13 @@ one occurrence per billion words. 7.77 >>> zipf_frequency('word', 'en') - 5.32 + 5.29 >>> zipf_frequency('frequency', 'en') - 4.38 + 4.42 >>> zipf_frequency('zipf', 'en') - 1.32 + 1.55 >>> zipf_frequency('zipf', 'en', wordlist='small') 0.0 @@ -112,13 +113,29 @@ packing the words into frequency bins. In wordfreq, all words that have the same Zipf frequency rounded to the nearest hundredth have the same frequency. We don't store any more precision than that. So instead of having to store that the frequency of a word is -.000011748975549395302, information that is mostly meaningless, we just store -the 600 possible frequency bins and the words they contain. +.000011748975549395302, where most of those digits are meaningless, we just store +the frequency bins and the words they contain. Because the Zipf scale is a logarithmic scale, this preserves the same relative precision no matter how far down you are in the word list. The frequency of any -word is precise to within 1%. (This is not a claim about _accuracy_, which it's -unclear how you'd even measure, just about _precision_.) +word is precise to within 1%. + +(This is not a claim about _accuracy_, but about _precision_. We believe that +the way we use multiple data sources and discard outliers makes wordfreq a +more accurate measurement of the way these words are really used in written +language, but it's unclear how one would measure this accuracy.) + + +## The figure-skating metric + +We combine word frequencies from different sources in a way that's designed +to minimize the impact of outliers. The method reminds me of the scoring system +in Olympic figure skating: + +- Find the frequency of each word according to each data source. +- For each word, drop the sources that give it the highest and lowest frequency. +- Average the remaining frequencies. +- Rescale the resulting frequency list to add up to 1. ## Sources and supported languages @@ -133,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves come from multiple sources: - **Wikipedia**, representing encyclopedic text -- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX +- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX - **News**, from NewsCrawl 2014 and GlobalVoices - **Books**, from Google Books Ngrams 2012 -- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus +- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK + Hungarian Webcorpus - **Twitter**, representing short-form social media - **Reddit**, representing potentially longer Internet comments - **Miscellaneous** word frequencies: in Chinese, we import a free wordlist - that comes with the Jieba word segmenter, whose provenance we don't really know + that comes with the Jieba word segmenter, whose provenance we don't really + know The following languages are supported, with reasonable tokenization and at least 3 different sources of word frequencies: @@ -225,10 +244,15 @@ wordlist, in descending frequency order. a wordlist as a dictionary, for cases where you'll want to look up a lot of words and don't need the wrapper that `word_frequency` provides. -`supported_languages(wordlist='best')` returns a dictionary whose keys are +`available_languages(wordlist='best')` returns a dictionary whose keys are language codes, and whose values are the data file that will be loaded to provide the requested wordlist in each language. +`get_language_info(lang)` returns a dictionary of information about how we +preprocess text in this language, such as what script we expect it to be +written in, which characters we normalize together, and how we tokenize it. +See its docstring for more information. + `random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)` returns a selection of random words, separated by spaces. `bits_per_word=n` will select each random word from 2^n words. @@ -274,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break into multiple tokens: >>> zipf_frequency('New York', 'en') - 5.35 + 5.28 >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway" - 3.54 + 3.57 The word frequencies are combined with the half-harmonic-mean function in order to provide an estimate of what their combined frequency would be. In Chinese, @@ -291,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate their frequency: >>> zipf_frequency('owl-flavored', 'en') - 3.18 + 3.2 ## Multi-script languages @@ -430,14 +454,14 @@ sources: - The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html) -- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave - (https://invokeit.wordpress.com/frequency-word-lists/) - - Wikipedia, the free encyclopedia (http://www.wikipedia.org) +- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu) + It contains data from OPUS OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the -OpenSubtitles project (http://www.opensubtitles.org/). +OpenSubtitles project (http://www.opensubtitles.org/) and may be used with +attribution to OpenSubtitles. It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. @@ -502,10 +526,6 @@ The same citation in BibTex format: Methods, 41 (4), 977-990. http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf -- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A. - (2015). The word frequency effect. Experimental Psychology. - http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea - - Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and implications for the choice of frequency estimates in German. Experimental @@ -515,9 +535,6 @@ The same citation in BibTex format: frequencies based on film subtitles. PLoS One, 5(6), e10729. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729 -- Dave, H. (2011). Frequency word lists. - https://invokeit.wordpress.com/frequency-word-lists/ - - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. http://unicode.org/reports/tr29/ @@ -535,11 +552,19 @@ The same citation in BibTex format: analyzer. http://mecab.sourceforge.net/ +- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov, + S. (2012). Syntactic annotations for the Google Books Ngram Corpus. + Proceedings of the ACL 2012 system demonstrations, 169-174. + http://aclweb.org/anthology/P12-3029 + - Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf +- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official + European Languages. https://paracrawl.eu/ + - van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.