diff --git a/CHANGELOG.md b/CHANGELOG.md index 2fb7188..dd5cbf3 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,9 +1,34 @@ +# Changelog + +## Version 3.0 (2022-03-10) + +This is the "handle numbers better" release. + +Previously, wordfreq would group all digit sequences of the same 'shape', +with length 2 or more, into a single token and return the frequency of that +token, which would be a vast overestimate. + +Now it distributes the frequency over all numbers of that shape, with an +estimated distribution that allows for Benford's law (lower numbers are more +frequent) and a special frequency distribution for 4-digit numbers that look +like years (2010 is more frequent than 1020). + +Relatedly: + +- Functions such as `iter_wordlist` and `top_n_list` no longer return + multi-digit numbers (they used to return them in their "smashed" form, such + as "0000"). + +- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens + instead in a place that's internal to the `word_frequency` function, so we can + look at the values of the digits before they're replaced. + ## Version 2.5.1 (2021-09-02) - Import ftfy and use its `uncurl_quotes` method to turn curly quotes into straight ones, providing consistency with multiple forms of apostrophes. -- Set minimum version requierements on `regex`, `jieba`, and `langcodes` +- Set minimum version requirements on `regex`, `jieba`, and `langcodes` so that tokenization will give consistent results. - Workaround an inconsistency in the `msgpack` API around @@ -83,7 +108,6 @@ Library changes: - Fixed calling `msgpack.load` with a deprecated parameter. - ## Version 2.2 (2018-07-24) Library change: @@ -104,7 +128,6 @@ Data changes: - The input data includes the change to tokenization described above, giving us word frequencies for words such as "l@s". - ## Version 2.1 (2018-06-18) Data changes: @@ -125,7 +148,6 @@ Library changes: in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them when they are installed from source. - ## Version 2.0.1 (2018-05-01) Fixed edge cases that inserted spurious token boundaries when Japanese text is @@ -148,8 +170,6 @@ use the iteration mark 々. This change does not affect any word frequencies. (The Japanese word list uses `wordfreq.mecab` for tokenization, not `simple_tokenize`.) - - ## Version 2.0 (2018-03-14) The big change in this version is that text preprocessing, tokenization, and @@ -212,7 +232,6 @@ Nitty gritty dependency changes: [exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus - ## Version 1.7.0 (2017-08-25) - Tokenization will always keep Unicode graphemes together, including @@ -223,7 +242,6 @@ Nitty gritty dependency changes: - Support Bengali and Macedonian, which passed the threshold of having enough source data to be included - ## Version 1.6.1 (2017-05-10) - Depend on langcodes 1.4, with a new language-matching system that does not @@ -232,13 +250,12 @@ Nitty gritty dependency changes: This prevents silly conflicts where langcodes' SQLite connection was preventing langcodes from being used in threads. - ## Version 1.6.0 (2017-01-05) - Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian - Add large lists in Chinese, Finnish, Japanese, and Polish - Data is now collected and built using Exquisite Corpus - (https://github.com/LuminosoInsight/exquisite-corpus) + () - Add word frequencies from OPUS OpenSubtitles 2016 - Add word frequencies from the MOKK Hungarian Webcorpus - Expand Google Books Ngrams data to cover 8 languages @@ -255,13 +272,11 @@ Nitty gritty dependency changes: - Another new frequency-merging strategy (drop the highest and lowest, average the rest) - ## Version 1.5.1 (2016-08-19) - Bug fix: Made it possible to load the Japanese or Korean dictionary when the other one is not available - ## Version 1.5.0 (2016-08-08) - Include word frequencies learned from the Common Crawl @@ -280,7 +295,6 @@ Nitty gritty dependency changes: [Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy) - ## Version 1.4 (2016-06-02) - Add large lists in English, German, Spanish, French, and Portuguese @@ -288,12 +302,10 @@ Nitty gritty dependency changes: [Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/) - ## Version 1.3 (2016-01-14) - Add Reddit comments as an English source - ## Version 1.2 (2015-10-29) - Add SUBTLEX data @@ -307,14 +319,12 @@ Nitty gritty dependency changes: [Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/) - ## Version 1.1 (2015-08-25) - Use the 'regex' package to implement Unicode tokenization that's mostly consistent across languages - Use NFKC normalization in Japanese and Arabic - ## Version 1.0 (2015-07-28) - Create compact word frequency lists in English, Arabic, German, Spanish, @@ -322,4 +332,3 @@ Nitty gritty dependency changes: - Marginal support for Greek, Korean, Chinese - Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably large downloads - diff --git a/README.md b/README.md index c916ede..8599188 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,6 @@ languages, based on many sources of data. Author: Robyn Speer - ## Installation wordfreq requires Python 3 and depends on a few other Python modules @@ -19,7 +18,6 @@ or by getting the repository and running its setup.py: See [Additional CJK installation](#additional-cjk-installation) for extra steps that are necessary to get Chinese, Japanese, and Korean word frequencies. - ## Usage wordfreq provides access to estimates of the frequency with which a word is @@ -56,7 +54,6 @@ frequency as a decimal between 0 and 1. >>> word_frequency('café', 'fr') 5.75e-05 - `zipf_frequency` is a variation on `word_frequency` that aims to return the word frequency on a human-friendly logarithmic scale. The Zipf scale was proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency @@ -86,7 +83,6 @@ one occurrence per billion words. >>> zipf_frequency('zipf', 'en', wordlist='small') 0.0 - The parameters to `word_frequency` and `zipf_frequency` are: - `word`: a Unicode string containing the word to look up. Ideally the word @@ -103,7 +99,6 @@ The parameters to `word_frequency` and `zipf_frequency` are: value contained in the wordlist, to avoid a discontinuity where the wordlist ends. - ## Frequency bins wordfreq's wordlists are designed to load quickly and take up little space in @@ -120,12 +115,11 @@ Because the Zipf scale is a logarithmic scale, this preserves the same relative precision no matter how far down you are in the word list. The frequency of any word is precise to within 1%. -(This is not a claim about _accuracy_, but about _precision_. We believe that +(This is not a claim about *accuracy*, but about *precision*. We believe that the way we use multiple data sources and discard outliers makes wordfreq a more accurate measurement of the way these words are really used in written language, but it's unclear how one would measure this accuracy.) - ## The figure-skating metric We combine word frequencies from different sources in a way that's designed @@ -137,6 +131,68 @@ in Olympic figure skating: - Average the remaining frequencies. - Rescale the resulting frequency list to add up to 1. +## Numbers + +These wordlists would be enormous if they stored a separate frequency for every +number, such as if we separately stored the frequencies of 484977 and 484978 +and 98.371 and every other 6-character sequence that could be considered a number. + +Instead, we have a frequency-bin entry for every number of the same "shape", such +as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility +with earlier versions of wordfreq, our stand-in character is actually `0`.) This +is the same form of aggregation that the word2vec vocabulary does. + +Single-digit numbers are unaffected by this "binning" process; "0" through "9" have +their own entries in each language's wordlist. + +When asked for the frequency of a token containing multiple digits, we multiply +the frequency of that aggregated entry by a distribution estimating the frequency +of those digits. The distribution only looks at two things: + +- The value of the first digit +- Whether it is a 4-digit sequence that's likely to represent a year + +The first digits are assigned probabilities by Benford's law, and years are assigned +probabilities from a distribution that peaks at the "present". I explored this in +a Twitter thread at . + +The part of this distribution representing the "present" is not strictly a peak; +it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books +Ngrams was updated, and 2039 is a time by which I will probably have figured out +a new distribution.) + +Some examples: + + >>> word_frequency("2022", "en") + 5.15e-05 + >>> word_frequency("1922", "en") + 8.19e-06 + >>> word_frequency("1022", "en") + 1.28e-07 + +Aside from years, the distribution does **not** care about the meaning of the numbers: + + >>> word_frequency("90210", "en") + 3.34e-10 + >>> word_frequency("92222", "en") + 3.34e-10 + >>> word_frequency("802.11n", "en") + 9.04e-13 + >>> word_frequency("899.19n", "en") + 9.04e-13 + +The digit rule applies to other systems of digits, and only cares about the numeric +value of the digits: + + >>> word_frequency("٥٤", "ar") + 6.64e-05 + >>> word_frequency("54", "ar") + 6.64e-05 + +It doesn't know which language uses which writing system for digits: + + >>> word_frequency("٥٤", "en") + 5.4e-05 ## Sources and supported languages @@ -227,7 +283,6 @@ Some languages provide 'large' wordlists, including words with a Zipf frequency between 1.0 and 3.0. These are available in 14 languages that are covered by enough data sources. - ## Other functions `tokenize(text, lang)` splits text in the given language into words, in the same @@ -273,7 +328,6 @@ ASCII. But maybe you should just use [xkpa][]. [xkcd936]: https://xkcd.com/936/ [xkpa]: https://github.com/beala/xkcd-password - ## Tokenization wordfreq uses the Python package `regex`, which is a more advanced @@ -335,7 +389,6 @@ their frequency: >>> zipf_frequency('owl-flavored', 'en') 3.3 - ## Multi-script languages Two of the languages we support, Serbian and Chinese, are written in multiple @@ -358,7 +411,6 @@ Enumerating the Chinese wordlist will produce some unfamiliar words, because people don't actually write in Oversimplified Chinese, and because in practice Traditional and Simplified Chinese also have different word usage. - ## Similar, overlapping, and varying languages As much as we would like to give each language its own distinct code and its @@ -384,7 +436,6 @@ module to find the best match for a language code. If you ask for word frequencies in `cmn-Hans` (the fully specific language code for Mandarin in Simplified Chinese), you will get the `zh` wordlist, for example. - ## Additional CJK installation Chinese, Japanese, and Korean have additional external dependencies so that @@ -399,17 +450,16 @@ and `mecab-ko-dic`. As of version 2.4.2, you no longer have to install dictionaries separately. - ## License `wordfreq` is freely redistributable under the MIT license (see `MIT-LICENSE.txt`), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 -license (https://creativecommons.org/licenses/by-sa/4.0/). +license (). `wordfreq` contains data extracted from Google Books Ngrams -(http://books.google.com/ngrams) and Google Books Syntactic Ngrams -(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). +() and Google Books Syntactic Ngrams +(). The terms of use of this data are: Ngram Viewer graphs and data may be freely used for any purpose, although @@ -420,21 +470,21 @@ The terms of use of this data are: sources: - The Leeds Internet Corpus, from the University of Leeds Centre for Translation - Studies (http://corpus.leeds.ac.uk/list.html) + Studies () -- Wikipedia, the free encyclopedia (http://www.wikipedia.org) +- Wikipedia, the free encyclopedia () -- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu) +- ParaCrawl, a multilingual Web crawl () It contains data from OPUS OpenSubtitles 2018 -(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the -OpenSubtitles project (http://www.opensubtitles.org/) and may be used with +(), whose data originates from the +OpenSubtitles project () and may be used with attribution to OpenSubtitles. It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. (see citations below) and available at -http://crr.ugent.be/programs-data/subtitle-frequencies. +. I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to distribute these wordlists in wordfreq, to be used for any purpose, not just @@ -450,7 +500,6 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content. - ## Citing wordfreq If you use wordfreq in your research, please cite it! We publish the code @@ -459,8 +508,7 @@ citation is: > Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan. > (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo. -> https://doi.org/10.5281/zenodo.1443582 - +> The same citation in BibTex format: @@ -479,20 +527,19 @@ The same citation in BibTex format: } ``` - ## Citations to work that wordfreq is built on - Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Machine Translation. - http://www.statmt.org/wmt15/results.html + - Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical Evaluation of Current Word Frequency Norms and the Introduction of a New and Improved Word Frequency Measure for American English. Behavior Research Methods, 41 (4), 977-990. - http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf + - Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. (2011). The word frequency effect: A review of recent developments and @@ -501,45 +548,45 @@ The same citation in BibTex format: - Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PLoS One, 5(6), e10729. - http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729 + - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. - http://unicode.org/reports/tr29/ + - Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V. (2004). Creating open language resources for Hungarian. In Proceedings of the 4th international conference on Language Resources and Evaluation (LREC2004). - http://mokk.bme.hu/resources/webcorpus/ + - Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency measure for Dutch words based on film subtitles. Behavior Research Methods, 42(3), 643-650. - http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf + - Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological analyzer. - http://mecab.sourceforge.net/ + - Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov, S. (2012). Syntactic annotations for the Google Books Ngram Corpus. Proceedings of the ACL 2012 system demonstrations, 169-174. - http://aclweb.org/anthology/P12-3029 + - Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). - http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf + - Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. - https://oscar-corpus.com/publication/2019/clmc7/asynchronous/ + - ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official - European Languages. https://paracrawl.eu/ + European Languages. - van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190. - http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521 + diff --git a/pyproject.toml b/pyproject.toml index 1ae9bde..b83d9ac 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [tool.poetry] name = "wordfreq" -version = "2.6.0" +version = "3.0.0" description = "Look up the frequencies of words in many languages, based on many sources of data." authors = ["Robyn Speer "] license = "MIT" diff --git a/setup.py b/setup.py index 539b4ae..9f3c24b 100755 --- a/setup.py +++ b/setup.py @@ -33,7 +33,7 @@ dependencies = [ setup( name="wordfreq", - version='2.6.0', + version='3.0.0', maintainer='Robyn Speer', maintainer_email='rspeer@arborelia.net', url='http://github.com/rspeer/wordfreq/', diff --git a/tests/test_numbers.py b/tests/test_numbers.py index 6b106ef..339fbc8 100644 --- a/tests/test_numbers.py +++ b/tests/test_numbers.py @@ -16,6 +16,11 @@ def test_decimals(): assert word_frequency("3,14", "de") == word_frequency("3,15", "de") +def test_eastern_arabic(): + assert word_frequency("٥٤", "ar") == word_frequency("٥٣", "ar") + assert word_frequency("٤٣", "ar") > word_frequency("٥٤", "ar") + + def test_year_distribution(): assert word_frequency("2010", "en") > word_frequency("1010", "en") assert word_frequency("2010", "en") > word_frequency("3010", "en")