update version and documentation

This commit is contained in:
Elia Robyn Lake 2022-03-10 19:12:45 -05:00
parent bf05b1b1dc
commit ed7dccbf8b
5 changed files with 119 additions and 58 deletions

View File

@ -1,9 +1,34 @@
# Changelog
## Version 3.0 (2022-03-10)
This is the "handle numbers better" release.
Previously, wordfreq would group all digit sequences of the same 'shape',
with length 2 or more, into a single token and return the frequency of that
token, which would be a vast overestimate.
Now it distributes the frequency over all numbers of that shape, with an
estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).
Relatedly:
- Functions such as `iter_wordlist` and `top_n_list` no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
as "0000").
- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens
instead in a place that's internal to the `word_frequency` function, so we can
look at the values of the digits before they're replaced.
## Version 2.5.1 (2021-09-02) ## Version 2.5.1 (2021-09-02)
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into - Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
straight ones, providing consistency with multiple forms of apostrophes. straight ones, providing consistency with multiple forms of apostrophes.
- Set minimum version requierements on `regex`, `jieba`, and `langcodes` - Set minimum version requirements on `regex`, `jieba`, and `langcodes`
so that tokenization will give consistent results. so that tokenization will give consistent results.
- Workaround an inconsistency in the `msgpack` API around - Workaround an inconsistency in the `msgpack` API around
@ -83,7 +108,6 @@ Library changes:
- Fixed calling `msgpack.load` with a deprecated parameter. - Fixed calling `msgpack.load` with a deprecated parameter.
## Version 2.2 (2018-07-24) ## Version 2.2 (2018-07-24)
Library change: Library change:
@ -104,7 +128,6 @@ Data changes:
- The input data includes the change to tokenization described above, giving - The input data includes the change to tokenization described above, giving
us word frequencies for words such as "l@s". us word frequencies for words such as "l@s".
## Version 2.1 (2018-06-18) ## Version 2.1 (2018-06-18)
Data changes: Data changes:
@ -125,7 +148,6 @@ Library changes:
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
when they are installed from source. when they are installed from source.
## Version 2.0.1 (2018-05-01) ## Version 2.0.1 (2018-05-01)
Fixed edge cases that inserted spurious token boundaries when Japanese text is Fixed edge cases that inserted spurious token boundaries when Japanese text is
@ -148,8 +170,6 @@ use the iteration mark 々.
This change does not affect any word frequencies. (The Japanese word list uses This change does not affect any word frequencies. (The Japanese word list uses
`wordfreq.mecab` for tokenization, not `simple_tokenize`.) `wordfreq.mecab` for tokenization, not `simple_tokenize`.)
## Version 2.0 (2018-03-14) ## Version 2.0 (2018-03-14)
The big change in this version is that text preprocessing, tokenization, and The big change in this version is that text preprocessing, tokenization, and
@ -212,7 +232,6 @@ Nitty gritty dependency changes:
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus [exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
## Version 1.7.0 (2017-08-25) ## Version 1.7.0 (2017-08-25)
- Tokenization will always keep Unicode graphemes together, including - Tokenization will always keep Unicode graphemes together, including
@ -223,7 +242,6 @@ Nitty gritty dependency changes:
- Support Bengali and Macedonian, which passed the threshold of having enough - Support Bengali and Macedonian, which passed the threshold of having enough
source data to be included source data to be included
## Version 1.6.1 (2017-05-10) ## Version 1.6.1 (2017-05-10)
- Depend on langcodes 1.4, with a new language-matching system that does not - Depend on langcodes 1.4, with a new language-matching system that does not
@ -232,13 +250,12 @@ Nitty gritty dependency changes:
This prevents silly conflicts where langcodes' SQLite connection was This prevents silly conflicts where langcodes' SQLite connection was
preventing langcodes from being used in threads. preventing langcodes from being used in threads.
## Version 1.6.0 (2017-01-05) ## Version 1.6.0 (2017-01-05)
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian - Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
- Add large lists in Chinese, Finnish, Japanese, and Polish - Add large lists in Chinese, Finnish, Japanese, and Polish
- Data is now collected and built using Exquisite Corpus - Data is now collected and built using Exquisite Corpus
(https://github.com/LuminosoInsight/exquisite-corpus) (<https://github.com/LuminosoInsight/exquisite-corpus>)
- Add word frequencies from OPUS OpenSubtitles 2016 - Add word frequencies from OPUS OpenSubtitles 2016
- Add word frequencies from the MOKK Hungarian Webcorpus - Add word frequencies from the MOKK Hungarian Webcorpus
- Expand Google Books Ngrams data to cover 8 languages - Expand Google Books Ngrams data to cover 8 languages
@ -255,13 +272,11 @@ Nitty gritty dependency changes:
- Another new frequency-merging strategy (drop the highest and lowest, - Another new frequency-merging strategy (drop the highest and lowest,
average the rest) average the rest)
## Version 1.5.1 (2016-08-19) ## Version 1.5.1 (2016-08-19)
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the - Bug fix: Made it possible to load the Japanese or Korean dictionary when the
other one is not available other one is not available
## Version 1.5.0 (2016-08-08) ## Version 1.5.0 (2016-08-08)
- Include word frequencies learned from the Common Crawl - Include word frequencies learned from the Common Crawl
@ -280,7 +295,6 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy) [Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy)
## Version 1.4 (2016-06-02) ## Version 1.4 (2016-06-02)
- Add large lists in English, German, Spanish, French, and Portuguese - Add large lists in English, German, Spanish, French, and Portuguese
@ -288,12 +302,10 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/) [Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/)
## Version 1.3 (2016-01-14) ## Version 1.3 (2016-01-14)
- Add Reddit comments as an English source - Add Reddit comments as an English source
## Version 1.2 (2015-10-29) ## Version 1.2 (2015-10-29)
- Add SUBTLEX data - Add SUBTLEX data
@ -307,14 +319,12 @@ Nitty gritty dependency changes:
[Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/) [Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/)
## Version 1.1 (2015-08-25) ## Version 1.1 (2015-08-25)
- Use the 'regex' package to implement Unicode tokenization that's mostly - Use the 'regex' package to implement Unicode tokenization that's mostly
consistent across languages consistent across languages
- Use NFKC normalization in Japanese and Arabic - Use NFKC normalization in Japanese and Arabic
## Version 1.0 (2015-07-28) ## Version 1.0 (2015-07-28)
- Create compact word frequency lists in English, Arabic, German, Spanish, - Create compact word frequency lists in English, Arabic, German, Spanish,
@ -322,4 +332,3 @@ Nitty gritty dependency changes:
- Marginal support for Greek, Korean, Chinese - Marginal support for Greek, Korean, Chinese
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably - Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably
large downloads large downloads

123
README.md
View File

@ -3,7 +3,6 @@ languages, based on many sources of data.
Author: Robyn Speer Author: Robyn Speer
## Installation ## Installation
wordfreq requires Python 3 and depends on a few other Python modules wordfreq requires Python 3 and depends on a few other Python modules
@ -19,7 +18,6 @@ or by getting the repository and running its setup.py:
See [Additional CJK installation](#additional-cjk-installation) for extra See [Additional CJK installation](#additional-cjk-installation) for extra
steps that are necessary to get Chinese, Japanese, and Korean word frequencies. steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
@ -56,7 +54,6 @@ frequency as a decimal between 0 and 1.
>>> word_frequency('café', 'fr') >>> word_frequency('café', 'fr')
5.75e-05 5.75e-05
`zipf_frequency` is a variation on `word_frequency` that aims to return the `zipf_frequency` is a variation on `word_frequency` that aims to return the
word frequency on a human-friendly logarithmic scale. The Zipf scale was word frequency on a human-friendly logarithmic scale. The Zipf scale was
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
@ -86,7 +83,6 @@ one occurrence per billion words.
>>> zipf_frequency('zipf', 'en', wordlist='small') >>> zipf_frequency('zipf', 'en', wordlist='small')
0.0 0.0
The parameters to `word_frequency` and `zipf_frequency` are: The parameters to `word_frequency` and `zipf_frequency` are:
- `word`: a Unicode string containing the word to look up. Ideally the word - `word`: a Unicode string containing the word to look up. Ideally the word
@ -103,7 +99,6 @@ The parameters to `word_frequency` and `zipf_frequency` are:
value contained in the wordlist, to avoid a discontinuity where the wordlist value contained in the wordlist, to avoid a discontinuity where the wordlist
ends. ends.
## Frequency bins ## Frequency bins
wordfreq's wordlists are designed to load quickly and take up little space in wordfreq's wordlists are designed to load quickly and take up little space in
@ -120,12 +115,11 @@ Because the Zipf scale is a logarithmic scale, this preserves the same relative
precision no matter how far down you are in the word list. The frequency of any precision no matter how far down you are in the word list. The frequency of any
word is precise to within 1%. word is precise to within 1%.
(This is not a claim about _accuracy_, but about _precision_. We believe that (This is not a claim about *accuracy*, but about *precision*. We believe that
the way we use multiple data sources and discard outliers makes wordfreq a the way we use multiple data sources and discard outliers makes wordfreq a
more accurate measurement of the way these words are really used in written more accurate measurement of the way these words are really used in written
language, but it's unclear how one would measure this accuracy.) language, but it's unclear how one would measure this accuracy.)
## The figure-skating metric ## The figure-skating metric
We combine word frequencies from different sources in a way that's designed We combine word frequencies from different sources in a way that's designed
@ -137,6 +131,68 @@ in Olympic figure skating:
- Average the remaining frequencies. - Average the remaining frequencies.
- Rescale the resulting frequency list to add up to 1. - Rescale the resulting frequency list to add up to 1.
## Numbers
These wordlists would be enormous if they stored a separate frequency for every
number, such as if we separately stored the frequencies of 484977 and 484978
and 98.371 and every other 6-character sequence that could be considered a number.
Instead, we have a frequency-bin entry for every number of the same "shape", such
as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
is the same form of aggregation that the word2vec vocabulary does.
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
their own entries in each language's wordlist.
When asked for the frequency of a token containing multiple digits, we multiply
the frequency of that aggregated entry by a distribution estimating the frequency
of those digits. The distribution only looks at two things:
- The value of the first digit
- Whether it is a 4-digit sequence that's likely to represent a year
The first digits are assigned probabilities by Benford's law, and years are assigned
probabilities from a distribution that peaks at the "present". I explored this in
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
The part of this distribution representing the "present" is not strictly a peak;
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
Ngrams was updated, and 2039 is a time by which I will probably have figured out
a new distribution.)
Some examples:
>>> word_frequency("2022", "en")
5.15e-05
>>> word_frequency("1922", "en")
8.19e-06
>>> word_frequency("1022", "en")
1.28e-07
Aside from years, the distribution does **not** care about the meaning of the numbers:
>>> word_frequency("90210", "en")
3.34e-10
>>> word_frequency("92222", "en")
3.34e-10
>>> word_frequency("802.11n", "en")
9.04e-13
>>> word_frequency("899.19n", "en")
9.04e-13
The digit rule applies to other systems of digits, and only cares about the numeric
value of the digits:
>>> word_frequency("٥٤", "ar")
6.64e-05
>>> word_frequency("54", "ar")
6.64e-05
It doesn't know which language uses which writing system for digits:
>>> word_frequency("٥٤", "en")
5.4e-05
## Sources and supported languages ## Sources and supported languages
@ -227,7 +283,6 @@ Some languages provide 'large' wordlists, including words with a Zipf frequency
between 1.0 and 3.0. These are available in 14 languages that are covered by between 1.0 and 3.0. These are available in 14 languages that are covered by
enough data sources. enough data sources.
## Other functions ## Other functions
`tokenize(text, lang)` splits text in the given language into words, in the same `tokenize(text, lang)` splits text in the given language into words, in the same
@ -273,7 +328,6 @@ ASCII. But maybe you should just use [xkpa][].
[xkcd936]: https://xkcd.com/936/ [xkcd936]: https://xkcd.com/936/
[xkpa]: https://github.com/beala/xkcd-password [xkpa]: https://github.com/beala/xkcd-password
## Tokenization ## Tokenization
wordfreq uses the Python package `regex`, which is a more advanced wordfreq uses the Python package `regex`, which is a more advanced
@ -335,7 +389,6 @@ their frequency:
>>> zipf_frequency('owl-flavored', 'en') >>> zipf_frequency('owl-flavored', 'en')
3.3 3.3
## Multi-script languages ## Multi-script languages
Two of the languages we support, Serbian and Chinese, are written in multiple Two of the languages we support, Serbian and Chinese, are written in multiple
@ -358,7 +411,6 @@ Enumerating the Chinese wordlist will produce some unfamiliar words, because
people don't actually write in Oversimplified Chinese, and because in people don't actually write in Oversimplified Chinese, and because in
practice Traditional and Simplified Chinese also have different word usage. practice Traditional and Simplified Chinese also have different word usage.
## Similar, overlapping, and varying languages ## Similar, overlapping, and varying languages
As much as we would like to give each language its own distinct code and its As much as we would like to give each language its own distinct code and its
@ -384,7 +436,6 @@ module to find the best match for a language code. If you ask for word
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
Simplified Chinese), you will get the `zh` wordlist, for example. Simplified Chinese), you will get the `zh` wordlist, for example.
## Additional CJK installation ## Additional CJK installation
Chinese, Japanese, and Korean have additional external dependencies so that Chinese, Japanese, and Korean have additional external dependencies so that
@ -399,17 +450,16 @@ and `mecab-ko-dic`.
As of version 2.4.2, you no longer have to install dictionaries separately. As of version 2.4.2, you no longer have to install dictionaries separately.
## License ## License
`wordfreq` is freely redistributable under the MIT license (see `wordfreq` is freely redistributable under the MIT license (see
`MIT-LICENSE.txt`), and it includes data files that may be `MIT-LICENSE.txt`), and it includes data files that may be
redistributed under a Creative Commons Attribution-ShareAlike 4.0 redistributed under a Creative Commons Attribution-ShareAlike 4.0
license (https://creativecommons.org/licenses/by-sa/4.0/). license (<https://creativecommons.org/licenses/by-sa/4.0/>).
`wordfreq` contains data extracted from Google Books Ngrams `wordfreq` contains data extracted from Google Books Ngrams
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams (<http://books.google.com/ngrams>) and Google Books Syntactic Ngrams
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). (<http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html>).
The terms of use of this data are: The terms of use of this data are:
Ngram Viewer graphs and data may be freely used for any purpose, although Ngram Viewer graphs and data may be freely used for any purpose, although
@ -420,21 +470,21 @@ The terms of use of this data are:
sources: sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation - The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies (http://corpus.leeds.ac.uk/list.html) Studies (<http://corpus.leeds.ac.uk/list.html>)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org) - Wikipedia, the free encyclopedia (<http://www.wikipedia.org>)
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu) - ParaCrawl, a multilingual Web crawl (<https://paracrawl.eu>)
It contains data from OPUS OpenSubtitles 2018 It contains data from OPUS OpenSubtitles 2018
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the (<http://opus.nlpl.eu/OpenSubtitles.php>), whose data originates from the
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with OpenSubtitles project (<http://www.opensubtitles.org/>) and may be used with
attribution to OpenSubtitles. attribution to OpenSubtitles.
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al. SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
(see citations below) and available at (see citations below) and available at
http://crr.ugent.be/programs-data/subtitle-frequencies. <http://crr.ugent.be/programs-data/subtitle-frequencies>.
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
distribute these wordlists in wordfreq, to be used for any purpose, not just distribute these wordlists in wordfreq, to be used for any purpose, not just
@ -450,7 +500,6 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on Policy. This software gives statistics about words that are commonly used on
Twitter; it does not display or republish any Twitter content. Twitter; it does not display or republish any Twitter content.
## Citing wordfreq ## Citing wordfreq
If you use wordfreq in your research, please cite it! We publish the code If you use wordfreq in your research, please cite it! We publish the code
@ -459,8 +508,7 @@ citation is:
> Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan. > Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan.
> (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo. > (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo.
> https://doi.org/10.5281/zenodo.1443582 > <https://doi.org/10.5281/zenodo.1443582>
The same citation in BibTex format: The same citation in BibTex format:
@ -479,20 +527,19 @@ The same citation in BibTex format:
} }
``` ```
## Citations to work that wordfreq is built on ## Citations to work that wordfreq is built on
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C., - Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C., Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
Machine Translation. Machine Translation.
http://www.statmt.org/wmt15/results.html <http://www.statmt.org/wmt15/results.html>
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical - Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
Evaluation of Current Word Frequency Norms and the Introduction of a New and Evaluation of Current Word Frequency Norms and the Introduction of a New and
Improved Word Frequency Measure for American English. Behavior Research Improved Word Frequency Measure for American English. Behavior Research
Methods, 41 (4), 977-990. Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf <http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf>
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A. - Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
(2011). The word frequency effect: A review of recent developments and (2011). The word frequency effect: A review of recent developments and
@ -501,45 +548,45 @@ The same citation in BibTex format:
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character - Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729. frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729 <http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729>
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29. - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/ <http://unicode.org/reports/tr29/>
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V. - Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
(2004). Creating open language resources for Hungarian. In Proceedings of the (2004). Creating open language resources for Hungarian. In Proceedings of the
4th international conference on Language Resources and Evaluation (LREC2004). 4th international conference on Language Resources and Evaluation (LREC2004).
http://mokk.bme.hu/resources/webcorpus/ <http://mokk.bme.hu/resources/webcorpus/>
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency - Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
measure for Dutch words based on film subtitles. Behavior Research Methods, measure for Dutch words based on film subtitles. Behavior Research Methods,
42(3), 643-650. 42(3), 643-650.
http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf <http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf>
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological - Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
analyzer. analyzer.
http://mecab.sourceforge.net/ <http://mecab.sourceforge.net/>
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov, - Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
S. (2012). Syntactic annotations for the Google Books Ngram Corpus. S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
Proceedings of the ACL 2012 system demonstrations, 169-174. Proceedings of the ACL 2012 system demonstrations, 169-174.
http://aclweb.org/anthology/P12-3029 <http://aclweb.org/anthology/P12-3029>
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large - Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
International Conference on Language Resources and Evaluation (LREC 2016). International Conference on Language Resources and Evaluation (LREC 2016).
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf <http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines - Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
for processing huge corpora on medium to low resource infrastructures. In for processing huge corpora on medium to low resource infrastructures. In
Proceedings of the Workshop on Challenges in the Management of Large Corpora Proceedings of the Workshop on Challenges in the Management of Large Corpora
(CMLC-7) 2019. (CMLC-7) 2019.
https://oscar-corpus.com/publication/2019/clmc7/asynchronous/ <https://oscar-corpus.com/publication/2019/clmc7/asynchronous/>
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official - ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
European Languages. https://paracrawl.eu/ European Languages. <https://paracrawl.eu/>
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). - van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English. SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190. The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521 <http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521>

View File

@ -1,6 +1,6 @@
[tool.poetry] [tool.poetry]
name = "wordfreq" name = "wordfreq"
version = "2.6.0" version = "3.0.0"
description = "Look up the frequencies of words in many languages, based on many sources of data." description = "Look up the frequencies of words in many languages, based on many sources of data."
authors = ["Robyn Speer <rspeer@arborelia.net>"] authors = ["Robyn Speer <rspeer@arborelia.net>"]
license = "MIT" license = "MIT"

View File

@ -33,7 +33,7 @@ dependencies = [
setup( setup(
name="wordfreq", name="wordfreq",
version='2.6.0', version='3.0.0',
maintainer='Robyn Speer', maintainer='Robyn Speer',
maintainer_email='rspeer@arborelia.net', maintainer_email='rspeer@arborelia.net',
url='http://github.com/rspeer/wordfreq/', url='http://github.com/rspeer/wordfreq/',

View File

@ -16,6 +16,11 @@ def test_decimals():
assert word_frequency("3,14", "de") == word_frequency("3,15", "de") assert word_frequency("3,14", "de") == word_frequency("3,15", "de")
def test_eastern_arabic():
assert word_frequency("٥٤", "ar") == word_frequency("٥٣", "ar")
assert word_frequency("٤٣", "ar") > word_frequency("٥٤", "ar")
def test_year_distribution(): def test_year_distribution():
assert word_frequency("2010", "en") > word_frequency("1010", "en") assert word_frequency("2010", "en") > word_frequency("1010", "en")
assert word_frequency("2010", "en") > word_frequency("3010", "en") assert word_frequency("2010", "en") > word_frequency("3010", "en")