mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-24 01:41:39 +00:00
update version and documentation
This commit is contained in:
parent
5d6a41499b
commit
2563eb8d72
45
CHANGELOG.md
45
CHANGELOG.md
@ -1,9 +1,34 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
## Version 3.0 (2022-03-10)
|
||||||
|
|
||||||
|
This is the "handle numbers better" release.
|
||||||
|
|
||||||
|
Previously, wordfreq would group all digit sequences of the same 'shape',
|
||||||
|
with length 2 or more, into a single token and return the frequency of that
|
||||||
|
token, which would be a vast overestimate.
|
||||||
|
|
||||||
|
Now it distributes the frequency over all numbers of that shape, with an
|
||||||
|
estimated distribution that allows for Benford's law (lower numbers are more
|
||||||
|
frequent) and a special frequency distribution for 4-digit numbers that look
|
||||||
|
like years (2010 is more frequent than 1020).
|
||||||
|
|
||||||
|
Relatedly:
|
||||||
|
|
||||||
|
- Functions such as `iter_wordlist` and `top_n_list` no longer return
|
||||||
|
multi-digit numbers (they used to return them in their "smashed" form, such
|
||||||
|
as "0000").
|
||||||
|
|
||||||
|
- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens
|
||||||
|
instead in a place that's internal to the `word_frequency` function, so we can
|
||||||
|
look at the values of the digits before they're replaced.
|
||||||
|
|
||||||
## Version 2.5.1 (2021-09-02)
|
## Version 2.5.1 (2021-09-02)
|
||||||
|
|
||||||
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
||||||
straight ones, providing consistency with multiple forms of apostrophes.
|
straight ones, providing consistency with multiple forms of apostrophes.
|
||||||
|
|
||||||
- Set minimum version requierements on `regex`, `jieba`, and `langcodes`
|
- Set minimum version requirements on `regex`, `jieba`, and `langcodes`
|
||||||
so that tokenization will give consistent results.
|
so that tokenization will give consistent results.
|
||||||
|
|
||||||
- Workaround an inconsistency in the `msgpack` API around
|
- Workaround an inconsistency in the `msgpack` API around
|
||||||
@ -83,7 +108,6 @@ Library changes:
|
|||||||
|
|
||||||
- Fixed calling `msgpack.load` with a deprecated parameter.
|
- Fixed calling `msgpack.load` with a deprecated parameter.
|
||||||
|
|
||||||
|
|
||||||
## Version 2.2 (2018-07-24)
|
## Version 2.2 (2018-07-24)
|
||||||
|
|
||||||
Library change:
|
Library change:
|
||||||
@ -104,7 +128,6 @@ Data changes:
|
|||||||
- The input data includes the change to tokenization described above, giving
|
- The input data includes the change to tokenization described above, giving
|
||||||
us word frequencies for words such as "l@s".
|
us word frequencies for words such as "l@s".
|
||||||
|
|
||||||
|
|
||||||
## Version 2.1 (2018-06-18)
|
## Version 2.1 (2018-06-18)
|
||||||
|
|
||||||
Data changes:
|
Data changes:
|
||||||
@ -125,7 +148,6 @@ Library changes:
|
|||||||
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
|
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
|
||||||
when they are installed from source.
|
when they are installed from source.
|
||||||
|
|
||||||
|
|
||||||
## Version 2.0.1 (2018-05-01)
|
## Version 2.0.1 (2018-05-01)
|
||||||
|
|
||||||
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
||||||
@ -148,8 +170,6 @@ use the iteration mark 々.
|
|||||||
This change does not affect any word frequencies. (The Japanese word list uses
|
This change does not affect any word frequencies. (The Japanese word list uses
|
||||||
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
|
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Version 2.0 (2018-03-14)
|
## Version 2.0 (2018-03-14)
|
||||||
|
|
||||||
The big change in this version is that text preprocessing, tokenization, and
|
The big change in this version is that text preprocessing, tokenization, and
|
||||||
@ -212,7 +232,6 @@ Nitty gritty dependency changes:
|
|||||||
|
|
||||||
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
|
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
|
||||||
|
|
||||||
|
|
||||||
## Version 1.7.0 (2017-08-25)
|
## Version 1.7.0 (2017-08-25)
|
||||||
|
|
||||||
- Tokenization will always keep Unicode graphemes together, including
|
- Tokenization will always keep Unicode graphemes together, including
|
||||||
@ -223,7 +242,6 @@ Nitty gritty dependency changes:
|
|||||||
- Support Bengali and Macedonian, which passed the threshold of having enough
|
- Support Bengali and Macedonian, which passed the threshold of having enough
|
||||||
source data to be included
|
source data to be included
|
||||||
|
|
||||||
|
|
||||||
## Version 1.6.1 (2017-05-10)
|
## Version 1.6.1 (2017-05-10)
|
||||||
|
|
||||||
- Depend on langcodes 1.4, with a new language-matching system that does not
|
- Depend on langcodes 1.4, with a new language-matching system that does not
|
||||||
@ -232,13 +250,12 @@ Nitty gritty dependency changes:
|
|||||||
This prevents silly conflicts where langcodes' SQLite connection was
|
This prevents silly conflicts where langcodes' SQLite connection was
|
||||||
preventing langcodes from being used in threads.
|
preventing langcodes from being used in threads.
|
||||||
|
|
||||||
|
|
||||||
## Version 1.6.0 (2017-01-05)
|
## Version 1.6.0 (2017-01-05)
|
||||||
|
|
||||||
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
|
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
|
||||||
- Add large lists in Chinese, Finnish, Japanese, and Polish
|
- Add large lists in Chinese, Finnish, Japanese, and Polish
|
||||||
- Data is now collected and built using Exquisite Corpus
|
- Data is now collected and built using Exquisite Corpus
|
||||||
(https://github.com/LuminosoInsight/exquisite-corpus)
|
(<https://github.com/LuminosoInsight/exquisite-corpus>)
|
||||||
- Add word frequencies from OPUS OpenSubtitles 2016
|
- Add word frequencies from OPUS OpenSubtitles 2016
|
||||||
- Add word frequencies from the MOKK Hungarian Webcorpus
|
- Add word frequencies from the MOKK Hungarian Webcorpus
|
||||||
- Expand Google Books Ngrams data to cover 8 languages
|
- Expand Google Books Ngrams data to cover 8 languages
|
||||||
@ -255,13 +272,11 @@ Nitty gritty dependency changes:
|
|||||||
- Another new frequency-merging strategy (drop the highest and lowest,
|
- Another new frequency-merging strategy (drop the highest and lowest,
|
||||||
average the rest)
|
average the rest)
|
||||||
|
|
||||||
|
|
||||||
## Version 1.5.1 (2016-08-19)
|
## Version 1.5.1 (2016-08-19)
|
||||||
|
|
||||||
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the
|
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the
|
||||||
other one is not available
|
other one is not available
|
||||||
|
|
||||||
|
|
||||||
## Version 1.5.0 (2016-08-08)
|
## Version 1.5.0 (2016-08-08)
|
||||||
|
|
||||||
- Include word frequencies learned from the Common Crawl
|
- Include word frequencies learned from the Common Crawl
|
||||||
@ -280,7 +295,6 @@ Nitty gritty dependency changes:
|
|||||||
|
|
||||||
[Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy)
|
[Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy)
|
||||||
|
|
||||||
|
|
||||||
## Version 1.4 (2016-06-02)
|
## Version 1.4 (2016-06-02)
|
||||||
|
|
||||||
- Add large lists in English, German, Spanish, French, and Portuguese
|
- Add large lists in English, German, Spanish, French, and Portuguese
|
||||||
@ -288,12 +302,10 @@ Nitty gritty dependency changes:
|
|||||||
|
|
||||||
[Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/)
|
[Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/)
|
||||||
|
|
||||||
|
|
||||||
## Version 1.3 (2016-01-14)
|
## Version 1.3 (2016-01-14)
|
||||||
|
|
||||||
- Add Reddit comments as an English source
|
- Add Reddit comments as an English source
|
||||||
|
|
||||||
|
|
||||||
## Version 1.2 (2015-10-29)
|
## Version 1.2 (2015-10-29)
|
||||||
|
|
||||||
- Add SUBTLEX data
|
- Add SUBTLEX data
|
||||||
@ -307,14 +319,12 @@ Nitty gritty dependency changes:
|
|||||||
|
|
||||||
[Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/)
|
[Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/)
|
||||||
|
|
||||||
|
|
||||||
## Version 1.1 (2015-08-25)
|
## Version 1.1 (2015-08-25)
|
||||||
|
|
||||||
- Use the 'regex' package to implement Unicode tokenization that's mostly
|
- Use the 'regex' package to implement Unicode tokenization that's mostly
|
||||||
consistent across languages
|
consistent across languages
|
||||||
- Use NFKC normalization in Japanese and Arabic
|
- Use NFKC normalization in Japanese and Arabic
|
||||||
|
|
||||||
|
|
||||||
## Version 1.0 (2015-07-28)
|
## Version 1.0 (2015-07-28)
|
||||||
|
|
||||||
- Create compact word frequency lists in English, Arabic, German, Spanish,
|
- Create compact word frequency lists in English, Arabic, German, Spanish,
|
||||||
@ -322,4 +332,3 @@ Nitty gritty dependency changes:
|
|||||||
- Marginal support for Greek, Korean, Chinese
|
- Marginal support for Greek, Korean, Chinese
|
||||||
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably
|
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably
|
||||||
large downloads
|
large downloads
|
||||||
|
|
||||||
|
123
README.md
123
README.md
@ -3,7 +3,6 @@ languages, based on many sources of data.
|
|||||||
|
|
||||||
Author: Robyn Speer
|
Author: Robyn Speer
|
||||||
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
wordfreq requires Python 3 and depends on a few other Python modules
|
wordfreq requires Python 3 and depends on a few other Python modules
|
||||||
@ -19,7 +18,6 @@ or by getting the repository and running its setup.py:
|
|||||||
See [Additional CJK installation](#additional-cjk-installation) for extra
|
See [Additional CJK installation](#additional-cjk-installation) for extra
|
||||||
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||||
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
@ -56,7 +54,6 @@ frequency as a decimal between 0 and 1.
|
|||||||
>>> word_frequency('café', 'fr')
|
>>> word_frequency('café', 'fr')
|
||||||
5.75e-05
|
5.75e-05
|
||||||
|
|
||||||
|
|
||||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||||
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
||||||
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
||||||
@ -86,7 +83,6 @@ one occurrence per billion words.
|
|||||||
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||||
0.0
|
0.0
|
||||||
|
|
||||||
|
|
||||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||||
|
|
||||||
- `word`: a Unicode string containing the word to look up. Ideally the word
|
- `word`: a Unicode string containing the word to look up. Ideally the word
|
||||||
@ -103,7 +99,6 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
|||||||
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
||||||
ends.
|
ends.
|
||||||
|
|
||||||
|
|
||||||
## Frequency bins
|
## Frequency bins
|
||||||
|
|
||||||
wordfreq's wordlists are designed to load quickly and take up little space in
|
wordfreq's wordlists are designed to load quickly and take up little space in
|
||||||
@ -120,12 +115,11 @@ Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
|||||||
precision no matter how far down you are in the word list. The frequency of any
|
precision no matter how far down you are in the word list. The frequency of any
|
||||||
word is precise to within 1%.
|
word is precise to within 1%.
|
||||||
|
|
||||||
(This is not a claim about _accuracy_, but about _precision_. We believe that
|
(This is not a claim about *accuracy*, but about *precision*. We believe that
|
||||||
the way we use multiple data sources and discard outliers makes wordfreq a
|
the way we use multiple data sources and discard outliers makes wordfreq a
|
||||||
more accurate measurement of the way these words are really used in written
|
more accurate measurement of the way these words are really used in written
|
||||||
language, but it's unclear how one would measure this accuracy.)
|
language, but it's unclear how one would measure this accuracy.)
|
||||||
|
|
||||||
|
|
||||||
## The figure-skating metric
|
## The figure-skating metric
|
||||||
|
|
||||||
We combine word frequencies from different sources in a way that's designed
|
We combine word frequencies from different sources in a way that's designed
|
||||||
@ -137,6 +131,68 @@ in Olympic figure skating:
|
|||||||
- Average the remaining frequencies.
|
- Average the remaining frequencies.
|
||||||
- Rescale the resulting frequency list to add up to 1.
|
- Rescale the resulting frequency list to add up to 1.
|
||||||
|
|
||||||
|
## Numbers
|
||||||
|
|
||||||
|
These wordlists would be enormous if they stored a separate frequency for every
|
||||||
|
number, such as if we separately stored the frequencies of 484977 and 484978
|
||||||
|
and 98.371 and every other 6-character sequence that could be considered a number.
|
||||||
|
|
||||||
|
Instead, we have a frequency-bin entry for every number of the same "shape", such
|
||||||
|
as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility
|
||||||
|
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
||||||
|
is the same form of aggregation that the word2vec vocabulary does.
|
||||||
|
|
||||||
|
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
|
||||||
|
their own entries in each language's wordlist.
|
||||||
|
|
||||||
|
When asked for the frequency of a token containing multiple digits, we multiply
|
||||||
|
the frequency of that aggregated entry by a distribution estimating the frequency
|
||||||
|
of those digits. The distribution only looks at two things:
|
||||||
|
|
||||||
|
- The value of the first digit
|
||||||
|
- Whether it is a 4-digit sequence that's likely to represent a year
|
||||||
|
|
||||||
|
The first digits are assigned probabilities by Benford's law, and years are assigned
|
||||||
|
probabilities from a distribution that peaks at the "present". I explored this in
|
||||||
|
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
||||||
|
|
||||||
|
The part of this distribution representing the "present" is not strictly a peak;
|
||||||
|
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
|
||||||
|
Ngrams was updated, and 2039 is a time by which I will probably have figured out
|
||||||
|
a new distribution.)
|
||||||
|
|
||||||
|
Some examples:
|
||||||
|
|
||||||
|
>>> word_frequency("2022", "en")
|
||||||
|
5.15e-05
|
||||||
|
>>> word_frequency("1922", "en")
|
||||||
|
8.19e-06
|
||||||
|
>>> word_frequency("1022", "en")
|
||||||
|
1.28e-07
|
||||||
|
|
||||||
|
Aside from years, the distribution does **not** care about the meaning of the numbers:
|
||||||
|
|
||||||
|
>>> word_frequency("90210", "en")
|
||||||
|
3.34e-10
|
||||||
|
>>> word_frequency("92222", "en")
|
||||||
|
3.34e-10
|
||||||
|
>>> word_frequency("802.11n", "en")
|
||||||
|
9.04e-13
|
||||||
|
>>> word_frequency("899.19n", "en")
|
||||||
|
9.04e-13
|
||||||
|
|
||||||
|
The digit rule applies to other systems of digits, and only cares about the numeric
|
||||||
|
value of the digits:
|
||||||
|
|
||||||
|
>>> word_frequency("٥٤", "ar")
|
||||||
|
6.64e-05
|
||||||
|
>>> word_frequency("54", "ar")
|
||||||
|
6.64e-05
|
||||||
|
|
||||||
|
It doesn't know which language uses which writing system for digits:
|
||||||
|
|
||||||
|
>>> word_frequency("٥٤", "en")
|
||||||
|
5.4e-05
|
||||||
|
|
||||||
## Sources and supported languages
|
## Sources and supported languages
|
||||||
|
|
||||||
@ -227,7 +283,6 @@ Some languages provide 'large' wordlists, including words with a Zipf frequency
|
|||||||
between 1.0 and 3.0. These are available in 14 languages that are covered by
|
between 1.0 and 3.0. These are available in 14 languages that are covered by
|
||||||
enough data sources.
|
enough data sources.
|
||||||
|
|
||||||
|
|
||||||
## Other functions
|
## Other functions
|
||||||
|
|
||||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||||
@ -273,7 +328,6 @@ ASCII. But maybe you should just use [xkpa][].
|
|||||||
[xkcd936]: https://xkcd.com/936/
|
[xkcd936]: https://xkcd.com/936/
|
||||||
[xkpa]: https://github.com/beala/xkcd-password
|
[xkpa]: https://github.com/beala/xkcd-password
|
||||||
|
|
||||||
|
|
||||||
## Tokenization
|
## Tokenization
|
||||||
|
|
||||||
wordfreq uses the Python package `regex`, which is a more advanced
|
wordfreq uses the Python package `regex`, which is a more advanced
|
||||||
@ -335,7 +389,6 @@ their frequency:
|
|||||||
>>> zipf_frequency('owl-flavored', 'en')
|
>>> zipf_frequency('owl-flavored', 'en')
|
||||||
3.3
|
3.3
|
||||||
|
|
||||||
|
|
||||||
## Multi-script languages
|
## Multi-script languages
|
||||||
|
|
||||||
Two of the languages we support, Serbian and Chinese, are written in multiple
|
Two of the languages we support, Serbian and Chinese, are written in multiple
|
||||||
@ -358,7 +411,6 @@ Enumerating the Chinese wordlist will produce some unfamiliar words, because
|
|||||||
people don't actually write in Oversimplified Chinese, and because in
|
people don't actually write in Oversimplified Chinese, and because in
|
||||||
practice Traditional and Simplified Chinese also have different word usage.
|
practice Traditional and Simplified Chinese also have different word usage.
|
||||||
|
|
||||||
|
|
||||||
## Similar, overlapping, and varying languages
|
## Similar, overlapping, and varying languages
|
||||||
|
|
||||||
As much as we would like to give each language its own distinct code and its
|
As much as we would like to give each language its own distinct code and its
|
||||||
@ -384,7 +436,6 @@ module to find the best match for a language code. If you ask for word
|
|||||||
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
||||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||||
|
|
||||||
|
|
||||||
## Additional CJK installation
|
## Additional CJK installation
|
||||||
|
|
||||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||||
@ -399,17 +450,16 @@ and `mecab-ko-dic`.
|
|||||||
|
|
||||||
As of version 2.4.2, you no longer have to install dictionaries separately.
|
As of version 2.4.2, you no longer have to install dictionaries separately.
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
`wordfreq` is freely redistributable under the MIT license (see
|
`wordfreq` is freely redistributable under the MIT license (see
|
||||||
`MIT-LICENSE.txt`), and it includes data files that may be
|
`MIT-LICENSE.txt`), and it includes data files that may be
|
||||||
redistributed under a Creative Commons Attribution-ShareAlike 4.0
|
redistributed under a Creative Commons Attribution-ShareAlike 4.0
|
||||||
license (https://creativecommons.org/licenses/by-sa/4.0/).
|
license (<https://creativecommons.org/licenses/by-sa/4.0/>).
|
||||||
|
|
||||||
`wordfreq` contains data extracted from Google Books Ngrams
|
`wordfreq` contains data extracted from Google Books Ngrams
|
||||||
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams
|
(<http://books.google.com/ngrams>) and Google Books Syntactic Ngrams
|
||||||
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html).
|
(<http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html>).
|
||||||
The terms of use of this data are:
|
The terms of use of this data are:
|
||||||
|
|
||||||
Ngram Viewer graphs and data may be freely used for any purpose, although
|
Ngram Viewer graphs and data may be freely used for any purpose, although
|
||||||
@ -420,21 +470,21 @@ The terms of use of this data are:
|
|||||||
sources:
|
sources:
|
||||||
|
|
||||||
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
||||||
Studies (http://corpus.leeds.ac.uk/list.html)
|
Studies (<http://corpus.leeds.ac.uk/list.html>)
|
||||||
|
|
||||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
- Wikipedia, the free encyclopedia (<http://www.wikipedia.org>)
|
||||||
|
|
||||||
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
|
- ParaCrawl, a multilingual Web crawl (<https://paracrawl.eu>)
|
||||||
|
|
||||||
It contains data from OPUS OpenSubtitles 2018
|
It contains data from OPUS OpenSubtitles 2018
|
||||||
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
|
(<http://opus.nlpl.eu/OpenSubtitles.php>), whose data originates from the
|
||||||
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
|
OpenSubtitles project (<http://www.opensubtitles.org/>) and may be used with
|
||||||
attribution to OpenSubtitles.
|
attribution to OpenSubtitles.
|
||||||
|
|
||||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||||
(see citations below) and available at
|
(see citations below) and available at
|
||||||
http://crr.ugent.be/programs-data/subtitle-frequencies.
|
<http://crr.ugent.be/programs-data/subtitle-frequencies>.
|
||||||
|
|
||||||
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
|
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
|
||||||
distribute these wordlists in wordfreq, to be used for any purpose, not just
|
distribute these wordlists in wordfreq, to be used for any purpose, not just
|
||||||
@ -450,7 +500,6 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
|||||||
Policy. This software gives statistics about words that are commonly used on
|
Policy. This software gives statistics about words that are commonly used on
|
||||||
Twitter; it does not display or republish any Twitter content.
|
Twitter; it does not display or republish any Twitter content.
|
||||||
|
|
||||||
|
|
||||||
## Citing wordfreq
|
## Citing wordfreq
|
||||||
|
|
||||||
If you use wordfreq in your research, please cite it! We publish the code
|
If you use wordfreq in your research, please cite it! We publish the code
|
||||||
@ -459,8 +508,7 @@ citation is:
|
|||||||
|
|
||||||
> Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan.
|
> Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan.
|
||||||
> (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo.
|
> (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo.
|
||||||
> https://doi.org/10.5281/zenodo.1443582
|
> <https://doi.org/10.5281/zenodo.1443582>
|
||||||
|
|
||||||
|
|
||||||
The same citation in BibTex format:
|
The same citation in BibTex format:
|
||||||
|
|
||||||
@ -479,20 +527,19 @@ The same citation in BibTex format:
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
## Citations to work that wordfreq is built on
|
## Citations to work that wordfreq is built on
|
||||||
|
|
||||||
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
|
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
|
||||||
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
|
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
|
||||||
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
|
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
|
||||||
Machine Translation.
|
Machine Translation.
|
||||||
http://www.statmt.org/wmt15/results.html
|
<http://www.statmt.org/wmt15/results.html>
|
||||||
|
|
||||||
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
|
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
|
||||||
Evaluation of Current Word Frequency Norms and the Introduction of a New and
|
Evaluation of Current Word Frequency Norms and the Introduction of a New and
|
||||||
Improved Word Frequency Measure for American English. Behavior Research
|
Improved Word Frequency Measure for American English. Behavior Research
|
||||||
Methods, 41 (4), 977-990.
|
Methods, 41 (4), 977-990.
|
||||||
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
|
<http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf>
|
||||||
|
|
||||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
||||||
(2011). The word frequency effect: A review of recent developments and
|
(2011). The word frequency effect: A review of recent developments and
|
||||||
@ -501,45 +548,45 @@ The same citation in BibTex format:
|
|||||||
|
|
||||||
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
|
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
|
||||||
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
||||||
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
<http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729>
|
||||||
|
|
||||||
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
||||||
http://unicode.org/reports/tr29/
|
<http://unicode.org/reports/tr29/>
|
||||||
|
|
||||||
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
|
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
|
||||||
(2004). Creating open language resources for Hungarian. In Proceedings of the
|
(2004). Creating open language resources for Hungarian. In Proceedings of the
|
||||||
4th international conference on Language Resources and Evaluation (LREC2004).
|
4th international conference on Language Resources and Evaluation (LREC2004).
|
||||||
http://mokk.bme.hu/resources/webcorpus/
|
<http://mokk.bme.hu/resources/webcorpus/>
|
||||||
|
|
||||||
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
|
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
|
||||||
measure for Dutch words based on film subtitles. Behavior Research Methods,
|
measure for Dutch words based on film subtitles. Behavior Research Methods,
|
||||||
42(3), 643-650.
|
42(3), 643-650.
|
||||||
http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf
|
<http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf>
|
||||||
|
|
||||||
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
|
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
|
||||||
analyzer.
|
analyzer.
|
||||||
http://mecab.sourceforge.net/
|
<http://mecab.sourceforge.net/>
|
||||||
|
|
||||||
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
||||||
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
||||||
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
||||||
http://aclweb.org/anthology/P12-3029
|
<http://aclweb.org/anthology/P12-3029>
|
||||||
|
|
||||||
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
||||||
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
||||||
International Conference on Language Resources and Evaluation (LREC 2016).
|
International Conference on Language Resources and Evaluation (LREC 2016).
|
||||||
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
<http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>
|
||||||
|
|
||||||
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
|
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
|
||||||
for processing huge corpora on medium to low resource infrastructures. In
|
for processing huge corpora on medium to low resource infrastructures. In
|
||||||
Proceedings of the Workshop on Challenges in the Management of Large Corpora
|
Proceedings of the Workshop on Challenges in the Management of Large Corpora
|
||||||
(CMLC-7) 2019.
|
(CMLC-7) 2019.
|
||||||
https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
|
<https://oscar-corpus.com/publication/2019/clmc7/asynchronous/>
|
||||||
|
|
||||||
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
||||||
European Languages. https://paracrawl.eu/
|
European Languages. <https://paracrawl.eu/>
|
||||||
|
|
||||||
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
||||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||||
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521
|
<http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521>
|
||||||
|
@ -1,6 +1,6 @@
|
|||||||
[tool.poetry]
|
[tool.poetry]
|
||||||
name = "wordfreq"
|
name = "wordfreq"
|
||||||
version = "2.6.0"
|
version = "3.0.0"
|
||||||
description = "Look up the frequencies of words in many languages, based on many sources of data."
|
description = "Look up the frequencies of words in many languages, based on many sources of data."
|
||||||
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
|
2
setup.py
2
setup.py
@ -33,7 +33,7 @@ dependencies = [
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="wordfreq",
|
name="wordfreq",
|
||||||
version='2.6.0',
|
version='3.0.0',
|
||||||
maintainer='Robyn Speer',
|
maintainer='Robyn Speer',
|
||||||
maintainer_email='rspeer@arborelia.net',
|
maintainer_email='rspeer@arborelia.net',
|
||||||
url='http://github.com/rspeer/wordfreq/',
|
url='http://github.com/rspeer/wordfreq/',
|
||||||
|
@ -16,6 +16,11 @@ def test_decimals():
|
|||||||
assert word_frequency("3,14", "de") == word_frequency("3,15", "de")
|
assert word_frequency("3,14", "de") == word_frequency("3,15", "de")
|
||||||
|
|
||||||
|
|
||||||
|
def test_eastern_arabic():
|
||||||
|
assert word_frequency("٥٤", "ar") == word_frequency("٥٣", "ar")
|
||||||
|
assert word_frequency("٤٣", "ar") > word_frequency("٥٤", "ar")
|
||||||
|
|
||||||
|
|
||||||
def test_year_distribution():
|
def test_year_distribution():
|
||||||
assert word_frequency("2010", "en") > word_frequency("1010", "en")
|
assert word_frequency("2010", "en") > word_frequency("1010", "en")
|
||||||
assert word_frequency("2010", "en") > word_frequency("3010", "en")
|
assert word_frequency("2010", "en") > word_frequency("3010", "en")
|
||||||
|
Loading…
Reference in New Issue
Block a user