update version and documentation

This commit is contained in:
Elia Robyn Lake 2022-03-10 19:12:45 -05:00
parent bf05b1b1dc
commit ed7dccbf8b
5 changed files with 119 additions and 58 deletions

View File

@ -1,9 +1,34 @@
# Changelog
## Version 3.0 (2022-03-10)
This is the "handle numbers better" release.
Previously, wordfreq would group all digit sequences of the same 'shape',
with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.
Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).
Relatedly:
- Functions such as `iter_wordlist` and `top_n_list` no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000").
- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to the `word_frequency` function, so we can
look at the values of the digits before they're replaced.
## Version 2.5.1 (2021-09-02)
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
straight ones, providing consistency with multiple forms of apostrophes.
- Set minimum version requierements on `regex`, `jieba`, and `langcodes`
- Set minimum version requirements on `regex`, `jieba`, and `langcodes`
so that tokenization will give consistent results.
- Workaround an inconsistency in the `msgpack` API around
@ -83,7 +108,6 @@ Library changes:
- Fixed calling `msgpack.load` with a deprecated parameter.
## Version 2.2 (2018-07-24)
Library change:
@ -104,7 +128,6 @@ Data changes:
- The input data includes the change to tokenization described above, giving
us word frequencies for words such as "l@s".
## Version 2.1 (2018-06-18)
Data changes:
@ -125,7 +148,6 @@ Library changes:
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
when they are installed from source.
## Version 2.0.1 (2018-05-01)
Fixed edge cases that inserted spurious token boundaries when Japanese text is
@ -148,8 +170,6 @@ use the iteration mark 々.
This change does not affect any word frequencies. (The Japanese word list uses
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
## Version 2.0 (2018-03-14)
The big change in this version is that text preprocessing, tokenization, and
@ -212,7 +232,6 @@ Nitty gritty dependency changes:
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
## Version 1.7.0 (2017-08-25)
- Tokenization will always keep Unicode graphemes together, including
@ -223,7 +242,6 @@ Nitty gritty dependency changes:
- Support Bengali and Macedonian, which passed the threshold of having enough
source data to be included
## Version 1.6.1 (2017-05-10)
- Depend on langcodes 1.4, with a new language-matching system that does not
@ -232,13 +250,12 @@ Nitty gritty dependency changes:
This prevents silly conflicts where langcodes' SQLite connection was
preventing langcodes from being used in threads.
## Version 1.6.0 (2017-01-05)
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
- Add large lists in Chinese, Finnish, Japanese, and Polish
- Data is now collected and built using Exquisite Corpus
(https://github.com/LuminosoInsight/exquisite-corpus)
(<https://github.com/LuminosoInsight/exquisite-corpus>)
- Add word frequencies from OPUS OpenSubtitles 2016
- Add word frequencies from the MOKK Hungarian Webcorpus
- Expand Google Books Ngrams data to cover 8 languages
@ -255,13 +272,11 @@ Nitty gritty dependency changes:
- Another new frequency-merging strategy (drop the highest and lowest,
average the rest)
## Version 1.5.1 (2016-08-19)
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the
other one is not available
## Version 1.5.0 (2016-08-08)
- Include word frequencies learned from the Common Crawl
@ -280,7 +295,6 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy)
## Version 1.4 (2016-06-02)
- Add large lists in English, German, Spanish, French, and Portuguese
@ -288,12 +302,10 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/)
## Version 1.3 (2016-01-14)
- Add Reddit comments as an English source
## Version 1.2 (2015-10-29)
- Add SUBTLEX data
@ -307,14 +319,12 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/)
## Version 1.1 (2015-08-25)
- Use the 'regex' package to implement Unicode tokenization that's mostly
consistent across languages
- Use NFKC normalization in Japanese and Arabic
## Version 1.0 (2015-07-28)
- Create compact word frequency lists in English, Arabic, German, Spanish,
@ -322,4 +332,3 @@ Nitty gritty dependency changes:
- Marginal support for Greek, Korean, Chinese
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably
large downloads

123
README.md
View File

@ -3,7 +3,6 @@ languages, based on many sources of data.
Author: Robyn Speer
## Installation
wordfreq requires Python 3 and depends on a few other Python modules
@ -19,7 +18,6 @@ or by getting the repository and running its setup.py:
See [Additional CJK installation](#additional-cjk-installation) for extra
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage
wordfreq provides access to estimates of the frequency with which a word is
@ -56,7 +54,6 @@ frequency as a decimal between 0 and 1.
>>> word_frequency('café', 'fr')
5.75e-05
`zipf_frequency` is a variation on `word_frequency` that aims to return the
word frequency on a human-friendly logarithmic scale. The Zipf scale was
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
@ -86,7 +83,6 @@ one occurrence per billion words.
>>> zipf_frequency('zipf', 'en', wordlist='small')
0.0
The parameters to `word_frequency` and `zipf_frequency` are:
- `word`: a Unicode string containing the word to look up. Ideally the word
@ -103,7 +99,6 @@ The parameters to `word_frequency` and `zipf_frequency` are:
value contained in the wordlist, to avoid a discontinuity where the wordlist
ends.
## Frequency bins
wordfreq's wordlists are designed to load quickly and take up little space in
@ -120,12 +115,11 @@ Because the Zipf scale is a logarithmic scale, this preserves the same relative
precision no matter how far down you are in the word list. The frequency of any
word is precise to within 1%.
(This is not a claim about _accuracy_, but about _precision_. We believe that
(This is not a claim about *accuracy*, but about *precision*. We believe that
the way we use multiple data sources and discard outliers makes wordfreq a
more accurate measurement of the way these words are really used in written
language, but it's unclear how one would measure this accuracy.)
## The figure-skating metric
We combine word frequencies from different sources in a way that's designed
@ -137,6 +131,68 @@ in Olympic figure skating:
- Average the remaining frequencies.
- Rescale the resulting frequency list to add up to 1.
## Numbers
These wordlists would be enormous if they stored a separate frequency for every
number, such as if we separately stored the frequencies of 484977 and 484978
and 98.371 and every other 6-character sequence that could be considered a number.
Instead, we have a frequency-bin entry for every number of the same "shape", such
as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
is the same form of aggregation that the word2vec vocabulary does.
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
their own entries in each language's wordlist.
When asked for the frequency of a token containing multiple digits, we multiply
the frequency of that aggregated entry by a distribution estimating the frequency
of those digits. The distribution only looks at two things:
- The value of the first digit
- Whether it is a 4-digit sequence that's likely to represent a year
The first digits are assigned probabilities by Benford's law, and years are assigned
probabilities from a distribution that peaks at the "present". I explored this in
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
The part of this distribution representing the "present" is not strictly a peak;
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
Ngrams was updated, and 2039 is a time by which I will probably have figured out
a new distribution.)
Some examples:
>>> word_frequency("2022", "en")
5.15e-05
>>> word_frequency("1922", "en")
8.19e-06
>>> word_frequency("1022", "en")
1.28e-07
Aside from years, the distribution does **not** care about the meaning of the numbers:
>>> word_frequency("90210", "en")
3.34e-10
>>> word_frequency("92222", "en")
3.34e-10
>>> word_frequency("802.11n", "en")
9.04e-13
>>> word_frequency("899.19n", "en")
9.04e-13
The digit rule applies to other systems of digits, and only cares about the numeric
value of the digits:
>>> word_frequency("٥٤", "ar")
6.64e-05
>>> word_frequency("54", "ar")
6.64e-05
It doesn't know which language uses which writing system for digits:
>>> word_frequency("٥٤", "en")
5.4e-05
## Sources and supported languages
@ -227,7 +283,6 @@ Some languages provide 'large' wordlists, including words with a Zipf frequency
between 1.0 and 3.0. These are available in 14 languages that are covered by
enough data sources.
## Other functions
`tokenize(text, lang)` splits text in the given language into words, in the same
@ -273,7 +328,6 @@ ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password
## Tokenization
wordfreq uses the Python package `regex`, which is a more advanced
@ -335,7 +389,6 @@ their frequency:
>>> zipf_frequency('owl-flavored', 'en')
3.3
## Multi-script languages
Two of the languages we support, Serbian and Chinese, are written in multiple
@ -358,7 +411,6 @@ Enumerating the Chinese wordlist will produce some unfamiliar words, because
people don't actually write in Oversimplified Chinese, and because in
practice Traditional and Simplified Chinese also have different word usage.
## Similar, overlapping, and varying languages
As much as we would like to give each language its own distinct code and its
@ -384,7 +436,6 @@ module to find the best match for a language code. If you ask for word
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
Simplified Chinese), you will get the `zh` wordlist, for example.
## Additional CJK installation
Chinese, Japanese, and Korean have additional external dependencies so that
@ -399,17 +450,16 @@ and `mecab-ko-dic`.
As of version 2.4.2, you no longer have to install dictionaries separately.
## License
`wordfreq` is freely redistributable under the MIT license (see
`MIT-LICENSE.txt`), and it includes data files that may be
redistributed under a Creative Commons Attribution-ShareAlike 4.0
license (https://creativecommons.org/licenses/by-sa/4.0/).
license (<https://creativecommons.org/licenses/by-sa/4.0/>).
`wordfreq` contains data extracted from Google Books Ngrams
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html).
(<http://books.google.com/ngrams>) and Google Books Syntactic Ngrams
(<http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html>).
The terms of use of this data are:
Ngram Viewer graphs and data may be freely used for any purpose, although
@ -420,21 +470,21 @@ The terms of use of this data are:
sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies (http://corpus.leeds.ac.uk/list.html)
Studies (<http://corpus.leeds.ac.uk/list.html>)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
- Wikipedia, the free encyclopedia (<http://www.wikipedia.org>)
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
- ParaCrawl, a multilingual Web crawl (<https://paracrawl.eu>)
It contains data from OPUS OpenSubtitles 2018
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
(<http://opus.nlpl.eu/OpenSubtitles.php>), whose data originates from the
OpenSubtitles project (<http://www.opensubtitles.org/>) and may be used with
attribution to OpenSubtitles.
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
(see citations below) and available at
http://crr.ugent.be/programs-data/subtitle-frequencies.
<http://crr.ugent.be/programs-data/subtitle-frequencies>.
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
distribute these wordlists in wordfreq, to be used for any purpose, not just
@ -450,7 +500,6 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on
Twitter; it does not display or republish any Twitter content.
## Citing wordfreq
If you use wordfreq in your research, please cite it! We publish the code
@ -459,8 +508,7 @@ citation is:
> Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan.
> (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo.
> https://doi.org/10.5281/zenodo.1443582
> <https://doi.org/10.5281/zenodo.1443582>
The same citation in BibTex format:
@ -479,20 +527,19 @@ The same citation in BibTex format:
}
```
## Citations to work that wordfreq is built on
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
Machine Translation.
http://www.statmt.org/wmt15/results.html
<http://www.statmt.org/wmt15/results.html>
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
Evaluation of Current Word Frequency Norms and the Introduction of a New and
Improved Word Frequency Measure for American English. Behavior Research
Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
<http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf>
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
(2011). The word frequency effect: A review of recent developments and
@ -501,45 +548,45 @@ The same citation in BibTex format:
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
<http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729>
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/
<http://unicode.org/reports/tr29/>
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
(2004). Creating open language resources for Hungarian. In Proceedings of the
4th international conference on Language Resources and Evaluation (LREC2004).
http://mokk.bme.hu/resources/webcorpus/
<http://mokk.bme.hu/resources/webcorpus/>
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
measure for Dutch words based on film subtitles. Behavior Research Methods,
42(3), 643-650.
http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf
<http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf>
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
analyzer.
http://mecab.sourceforge.net/
<http://mecab.sourceforge.net/>
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
Proceedings of the ACL 2012 system demonstrations, 169-174.
http://aclweb.org/anthology/P12-3029
<http://aclweb.org/anthology/P12-3029>
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
International Conference on Language Resources and Evaluation (LREC 2016).
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
<http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
for processing huge corpora on medium to low resource infrastructures. In
Proceedings of the Workshop on Challenges in the Management of Large Corpora
(CMLC-7) 2019.
https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
<https://oscar-corpus.com/publication/2019/clmc7/asynchronous/>
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
European Languages. https://paracrawl.eu/
European Languages. <https://paracrawl.eu/>
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521
<http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521>

View File

@ -1,6 +1,6 @@
[tool.poetry]
name = "wordfreq"
version = "2.6.0"
version = "3.0.0"
description = "Look up the frequencies of words in many languages, based on many sources of data."
authors = ["Robyn Speer <rspeer@arborelia.net>"]
license = "MIT"

View File

@ -33,7 +33,7 @@ dependencies = [
setup(
name="wordfreq",
version='2.6.0',
version='3.0.0',
maintainer='Robyn Speer',
maintainer_email='rspeer@arborelia.net',
url='http://github.com/rspeer/wordfreq/',

View File

@ -16,6 +16,11 @@ def test_decimals():
assert word_frequency("3,14", "de") == word_frequency("3,15", "de")
def test_eastern_arabic():
assert word_frequency("٥٤", "ar") == word_frequency("٥٣", "ar")
assert word_frequency("٤٣", "ar") > word_frequency("٥٤", "ar")
def test_year_distribution():
assert word_frequency("2010", "en") > word_frequency("1010", "en")
assert word_frequency("2010", "en") > word_frequency("3010", "en")