2018-06-18 19:15:07 +00:00
|
|
|
wordfreq is a Python library for looking up the frequencies of words in many
|
|
|
|
languages, based on many sources of data.
|
2013-10-28 23:26:44 +00:00
|
|
|
|
2024-06-24 23:02:22 +00:00
|
|
|
The word frequencies are a snapshot of language usage through about 2021. I may
|
|
|
|
continue to make packaging updates, but the data is unlikely to be updated again.
|
|
|
|
The world where I had a reasonable way to collect reliable word frequencies is
|
|
|
|
not the world we live in now. See [SUNSET.md](./SUNSET.md) for more information.
|
|
|
|
|
2013-10-28 23:26:44 +00:00
|
|
|
Author: Robyn Speer
|
2014-06-02 20:37:32 +00:00
|
|
|
|
2015-05-28 18:02:12 +00:00
|
|
|
## Installation
|
|
|
|
|
|
|
|
wordfreq requires Python 3 and depends on a few other Python modules
|
2018-03-14 19:01:08 +00:00
|
|
|
(msgpack, langcodes, and regex). You can install it and its dependencies
|
2015-05-28 18:02:12 +00:00
|
|
|
in the usual way, either by getting it from pip:
|
|
|
|
|
|
|
|
pip3 install wordfreq
|
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
or by getting the repository and installing it for development, using [poetry][]:
|
2015-05-28 18:02:12 +00:00
|
|
|
|
2022-03-11 00:22:53 +00:00
|
|
|
poetry install
|
|
|
|
|
|
|
|
[poetry]: https://python-poetry.org/
|
2015-05-28 18:02:12 +00:00
|
|
|
|
2018-02-28 21:14:29 +00:00
|
|
|
See [Additional CJK installation](#additional-cjk-installation) for extra
|
2017-08-25 21:38:31 +00:00
|
|
|
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
2015-09-24 21:54:52 +00:00
|
|
|
|
2015-08-28 21:49:38 +00:00
|
|
|
## Usage
|
2015-08-28 21:45:50 +00:00
|
|
|
|
|
|
|
wordfreq provides access to estimates of the frequency with which a word is
|
2022-03-11 15:43:37 +00:00
|
|
|
used, in over 40 languages (see *Supported languages* below). It uses many
|
|
|
|
different data sources, not just one corpus.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-03-08 23:16:15 +00:00
|
|
|
It provides both 'small' and 'large' wordlists:
|
2016-01-22 21:23:43 +00:00
|
|
|
|
2018-03-08 23:16:15 +00:00
|
|
|
- The 'small' lists take up very little memory and cover words that appear at
|
|
|
|
least once per million words.
|
|
|
|
- The 'large' lists cover words that appear at least once per 100 million
|
|
|
|
words.
|
2016-01-22 21:23:43 +00:00
|
|
|
|
2018-03-08 23:16:15 +00:00
|
|
|
The default list is 'best', which uses 'large' if it's available for the
|
|
|
|
language, and 'small' otherwise.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-03-08 23:16:15 +00:00
|
|
|
The most straightforward function for looking up frequencies is:
|
|
|
|
|
|
|
|
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
2015-08-28 21:45:50 +00:00
|
|
|
|
|
|
|
This function looks up a word's frequency in the given language, returning its
|
2018-06-18 19:15:07 +00:00
|
|
|
frequency as a decimal between 0 and 1.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
|
|
|
>>> from wordfreq import word_frequency
|
2018-06-18 19:15:07 +00:00
|
|
|
>>> word_frequency('cafe', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
1.23e-05
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
>>> word_frequency('café', 'en')
|
2020-10-01 20:05:43 +00:00
|
|
|
5.62e-06
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
>>> word_frequency('cafe', 'fr')
|
2021-03-29 20:18:08 +00:00
|
|
|
1.51e-06
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
>>> word_frequency('café', 'fr')
|
2021-03-29 20:18:08 +00:00
|
|
|
5.75e-05
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2016-01-22 21:23:43 +00:00
|
|
|
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
|
|
|
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
|
|
|
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
|
|
|
of a word is the base-10 logarithm of the number of times it appears per
|
|
|
|
billion words. A word with Zipf value 6 appears once per thousand words, for
|
|
|
|
example, and a word with Zipf value 3 appears once per million words.
|
|
|
|
|
|
|
|
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
|
|
|
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
2018-03-08 23:16:15 +00:00
|
|
|
'large' wordlists and 3.0 for 'small'. We use 0 as the default Zipf value
|
2016-01-22 21:23:43 +00:00
|
|
|
for words that do not appear in the given wordlist, although it should mean
|
|
|
|
one occurrence per billion words.
|
|
|
|
|
2016-07-28 23:23:17 +00:00
|
|
|
>>> from wordfreq import zipf_frequency
|
2016-01-22 21:23:43 +00:00
|
|
|
>>> zipf_frequency('the', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
7.73
|
2016-01-22 21:23:43 +00:00
|
|
|
|
|
|
|
>>> zipf_frequency('word', 'en')
|
2020-10-01 20:05:43 +00:00
|
|
|
5.26
|
2016-01-22 21:23:43 +00:00
|
|
|
|
|
|
|
>>> zipf_frequency('frequency', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
4.36
|
2016-01-22 21:23:43 +00:00
|
|
|
|
|
|
|
>>> zipf_frequency('zipf', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
1.49
|
2016-01-22 21:23:43 +00:00
|
|
|
|
2018-03-08 23:16:15 +00:00
|
|
|
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
|
|
|
0.0
|
2016-01-22 21:23:43 +00:00
|
|
|
|
|
|
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2015-08-28 21:49:07 +00:00
|
|
|
- `word`: a Unicode string containing the word to look up. Ideally the word
|
|
|
|
is a single token according to our tokenizer, but if not, there is still
|
|
|
|
hope -- see *Tokenization* below.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2015-08-28 21:49:07 +00:00
|
|
|
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2015-08-28 21:49:07 +00:00
|
|
|
- `wordlist`: which set of word frequencies to use. Current options are
|
2018-03-08 23:16:15 +00:00
|
|
|
'small', 'large', and 'best'.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2015-08-28 21:49:07 +00:00
|
|
|
- `minimum`: If the word is not in the list or has a frequency lower than
|
2016-01-22 21:23:43 +00:00
|
|
|
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
|
|
|
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
|
|
|
ends.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
## Frequency bins
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
wordfreq's wordlists are designed to load quickly and take up little space in
|
|
|
|
the repository. We accomplish this by avoiding meaningless precision and
|
|
|
|
packing the words into frequency bins.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
In wordfreq, all words that have the same Zipf frequency rounded to the nearest
|
|
|
|
hundredth have the same frequency. We don't store any more precision than that.
|
|
|
|
So instead of having to store that the frequency of a word is
|
2018-06-18 19:15:07 +00:00
|
|
|
.000011748975549395302, where most of those digits are meaningless, we just store
|
|
|
|
the frequency bins and the words they contain.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
|
|
|
precision no matter how far down you are in the word list. The frequency of any
|
2018-06-18 19:15:07 +00:00
|
|
|
word is precise to within 1%.
|
|
|
|
|
2022-03-11 00:12:45 +00:00
|
|
|
(This is not a claim about *accuracy*, but about *precision*. We believe that
|
2018-06-18 19:15:07 +00:00
|
|
|
the way we use multiple data sources and discard outliers makes wordfreq a
|
|
|
|
more accurate measurement of the way these words are really used in written
|
|
|
|
language, but it's unclear how one would measure this accuracy.)
|
|
|
|
|
|
|
|
## The figure-skating metric
|
|
|
|
|
|
|
|
We combine word frequencies from different sources in a way that's designed
|
|
|
|
to minimize the impact of outliers. The method reminds me of the scoring system
|
|
|
|
in Olympic figure skating:
|
|
|
|
|
|
|
|
- Find the frequency of each word according to each data source.
|
|
|
|
- For each word, drop the sources that give it the highest and lowest frequency.
|
|
|
|
- Average the remaining frequencies.
|
|
|
|
- Rescale the resulting frequency list to add up to 1.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2022-03-11 00:12:45 +00:00
|
|
|
## Numbers
|
|
|
|
|
|
|
|
These wordlists would be enormous if they stored a separate frequency for every
|
|
|
|
number, such as if we separately stored the frequencies of 484977 and 484978
|
|
|
|
and 98.371 and every other 6-character sequence that could be considered a number.
|
|
|
|
|
|
|
|
Instead, we have a frequency-bin entry for every number of the same "shape", such
|
|
|
|
as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility
|
|
|
|
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
|
|
|
is the same form of aggregation that the word2vec vocabulary does.
|
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
Single-digit numbers are unaffected by this process; "0" through "9" have their own
|
|
|
|
entries in each language's wordlist.
|
2022-03-11 00:12:45 +00:00
|
|
|
|
|
|
|
When asked for the frequency of a token containing multiple digits, we multiply
|
|
|
|
the frequency of that aggregated entry by a distribution estimating the frequency
|
|
|
|
of those digits. The distribution only looks at two things:
|
|
|
|
|
|
|
|
- The value of the first digit
|
|
|
|
- Whether it is a 4-digit sequence that's likely to represent a year
|
|
|
|
|
|
|
|
The first digits are assigned probabilities by Benford's law, and years are assigned
|
|
|
|
probabilities from a distribution that peaks at the "present". I explored this in
|
|
|
|
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
The part of this distribution representing the "present" is not strictly a peak and
|
|
|
|
doesn't move forward with time as the present does. Instead, it's a 20-year-long
|
|
|
|
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
|
|
|
|
and 2039 is a time by which I will probably have figured out a new distribution.)
|
2022-03-11 00:12:45 +00:00
|
|
|
|
|
|
|
Some examples:
|
|
|
|
|
|
|
|
>>> word_frequency("2022", "en")
|
|
|
|
5.15e-05
|
|
|
|
>>> word_frequency("1922", "en")
|
|
|
|
8.19e-06
|
|
|
|
>>> word_frequency("1022", "en")
|
|
|
|
1.28e-07
|
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
Aside from years, the distribution does not care about the meaning of the numbers:
|
2022-03-11 00:12:45 +00:00
|
|
|
|
|
|
|
>>> word_frequency("90210", "en")
|
|
|
|
3.34e-10
|
|
|
|
>>> word_frequency("92222", "en")
|
|
|
|
3.34e-10
|
|
|
|
>>> word_frequency("802.11n", "en")
|
|
|
|
9.04e-13
|
|
|
|
>>> word_frequency("899.19n", "en")
|
|
|
|
9.04e-13
|
|
|
|
|
|
|
|
The digit rule applies to other systems of digits, and only cares about the numeric
|
|
|
|
value of the digits:
|
|
|
|
|
|
|
|
>>> word_frequency("٥٤", "ar")
|
|
|
|
6.64e-05
|
|
|
|
>>> word_frequency("54", "ar")
|
|
|
|
6.64e-05
|
|
|
|
|
|
|
|
It doesn't know which language uses which writing system for digits:
|
|
|
|
|
|
|
|
>>> word_frequency("٥٤", "en")
|
|
|
|
5.4e-05
|
2015-08-28 21:45:50 +00:00
|
|
|
|
|
|
|
## Sources and supported languages
|
|
|
|
|
2017-01-06 00:18:06 +00:00
|
|
|
This data comes from a Luminoso project called [Exquisite Corpus][xc], whose
|
|
|
|
goal is to download good, varied, multilingual corpus data, process it
|
|
|
|
appropriately, and combine it into unified resources such as wordfreq.
|
|
|
|
|
2017-01-09 20:13:19 +00:00
|
|
|
[xc]: https://github.com/LuminosoInsight/exquisite-corpus
|
2017-01-06 00:18:06 +00:00
|
|
|
|
|
|
|
Exquisite Corpus compiles 8 different domains of text, some of which themselves
|
|
|
|
come from multiple sources:
|
|
|
|
|
|
|
|
- **Wikipedia**, representing encyclopedic text
|
2018-06-18 19:15:07 +00:00
|
|
|
- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
|
2017-01-06 00:18:06 +00:00
|
|
|
- **News**, from NewsCrawl 2014 and GlobalVoices
|
|
|
|
- **Books**, from Google Books Ngrams 2012
|
2021-04-15 18:45:29 +00:00
|
|
|
- **Web** text, from OSCAR
|
2017-01-06 00:18:06 +00:00
|
|
|
- **Twitter**, representing short-form social media
|
|
|
|
- **Reddit**, representing potentially longer Internet comments
|
|
|
|
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
|
2018-06-18 19:15:07 +00:00
|
|
|
that comes with the Jieba word segmenter, whose provenance we don't really
|
|
|
|
know
|
2017-01-06 00:18:06 +00:00
|
|
|
|
|
|
|
The following languages are supported, with reasonable tokenization and at
|
2016-07-28 23:23:17 +00:00
|
|
|
least 3 different sources of word frequencies:
|
|
|
|
|
2017-01-06 00:18:06 +00:00
|
|
|
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
|
|
|
──────────────────────────────┼────────────────────────────────────────────────
|
|
|
|
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Bangla bn 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2017-08-25 21:38:31 +00:00
|
|
|
Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Bulgarian bg 4 - │ Yes Yes - - Yes Yes - -
|
|
|
|
Catalan ca 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2018-05-25 20:12:35 +00:00
|
|
|
Chinese zh [3] 7 Yes │ Yes Yes Yes Yes Yes Yes - Jieba
|
2017-08-25 21:38:31 +00:00
|
|
|
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
2018-05-25 20:12:35 +00:00
|
|
|
Czech cs 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Danish da 4 - │ Yes Yes - - Yes Yes - -
|
2018-06-18 15:43:52 +00:00
|
|
|
Dutch nl 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2017-01-06 00:18:06 +00:00
|
|
|
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
2018-05-25 20:12:35 +00:00
|
|
|
Finnish fi 6 Yes │ Yes Yes Yes - Yes Yes Yes -
|
2017-01-06 00:18:06 +00:00
|
|
|
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
2017-08-25 21:38:31 +00:00
|
|
|
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
2021-03-30 17:17:36 +00:00
|
|
|
Greek el 4 - │ Yes Yes - - Yes Yes - -
|
|
|
|
Hebrew he 5 Yes │ Yes Yes - Yes Yes Yes - -
|
|
|
|
Hindi hi 4 Yes │ Yes - - - Yes Yes Yes -
|
|
|
|
Hungarian hu 4 - │ Yes Yes - - Yes Yes - -
|
|
|
|
Icelandic is 3 - │ Yes Yes - - Yes - - -
|
2017-01-06 00:18:06 +00:00
|
|
|
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
|
|
|
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
|
|
|
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
|
|
|
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
2018-05-25 20:12:35 +00:00
|
|
|
Latvian lv 4 - │ Yes Yes - - Yes Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Lithuanian lt 3 - │ Yes Yes - - Yes - - -
|
|
|
|
Macedonian mk 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2017-01-06 00:18:06 +00:00
|
|
|
Malay ms 3 - │ Yes Yes - - - Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Norwegian nb [2] 5 Yes │ Yes Yes - - Yes Yes Yes -
|
|
|
|
Persian fa 4 - │ Yes Yes - - Yes Yes - -
|
2018-05-25 20:12:35 +00:00
|
|
|
Polish pl 6 Yes │ Yes Yes Yes - Yes Yes Yes -
|
2017-01-06 00:18:06 +00:00
|
|
|
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
2021-03-30 17:17:36 +00:00
|
|
|
Romanian ro 3 - │ Yes Yes - - Yes - - -
|
2021-04-14 18:28:12 +00:00
|
|
|
Russian ru 5 Yes │ Yes Yes Yes Yes - Yes - -
|
2023-11-16 14:31:38 +00:00
|
|
|
Slovak sk 3 - │ Yes Yes - - Yes - - -
|
|
|
|
Slovenian sl 3 - │ Yes Yes - - Yes - - -
|
2017-01-07 00:04:40 +00:00
|
|
|
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
2017-08-25 21:38:31 +00:00
|
|
|
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
2021-03-30 17:17:36 +00:00
|
|
|
Swedish sv 5 Yes │ Yes Yes - - Yes Yes Yes -
|
|
|
|
Tagalog fil 3 - │ Yes Yes - - Yes - - -
|
|
|
|
Tamil ta 3 - │ Yes - - - Yes Yes - -
|
|
|
|
Turkish tr 4 - │ Yes Yes - - Yes Yes - -
|
|
|
|
Ukrainian uk 5 Yes │ Yes Yes - - Yes Yes Yes -
|
|
|
|
Urdu ur 3 - │ Yes - - - Yes Yes - -
|
|
|
|
Vietnamese vi 3 - │ Yes Yes - - Yes - - -
|
2017-01-06 00:18:06 +00:00
|
|
|
|
|
|
|
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
2017-01-07 00:04:40 +00:00
|
|
|
they share most of their vocabulary and grammar, they were once considered the
|
|
|
|
same language, and language detection cannot distinguish them. This word list
|
|
|
|
can also be accessed with the language code `sh`.
|
2017-01-06 00:18:06 +00:00
|
|
|
|
|
|
|
[2] The Norwegian text we have is specifically written in Norwegian Bokmål, so
|
2017-01-07 00:04:40 +00:00
|
|
|
we give it the language code 'nb' instead of the vaguer code 'no'. We would use
|
|
|
|
'nn' for Nynorsk, but there isn't enough data to include it in wordfreq.
|
2016-07-28 23:23:17 +00:00
|
|
|
|
2017-01-06 00:18:06 +00:00
|
|
|
[3] This data represents text written in both Simplified and Traditional
|
2017-01-07 00:04:40 +00:00
|
|
|
Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
|
|
|
|
languages" below.
|
2016-07-28 23:23:17 +00:00
|
|
|
|
|
|
|
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
2018-05-25 20:12:35 +00:00
|
|
|
between 1.0 and 3.0. These are available in 14 languages that are covered by
|
2016-07-28 23:23:17 +00:00
|
|
|
enough data sources.
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
## Other functions
|
|
|
|
|
|
|
|
`tokenize(text, lang)` splits text in the given language into words, in the same
|
|
|
|
way that the words in wordfreq's data were counted in the first place. See
|
|
|
|
*Tokenization*.
|
|
|
|
|
|
|
|
`top_n_list(lang, n, wordlist='best')` returns the most common *n* words in
|
|
|
|
the list, in descending frequency order.
|
|
|
|
|
|
|
|
>>> from wordfreq import top_n_list
|
|
|
|
>>> top_n_list('en', 10)
|
2021-03-29 20:18:08 +00:00
|
|
|
['the', 'to', 'and', 'of', 'a', 'in', 'i', 'is', 'for', 'that']
|
2018-06-15 19:42:54 +00:00
|
|
|
|
|
|
|
>>> top_n_list('es', 10)
|
2020-10-01 20:05:43 +00:00
|
|
|
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'un']
|
2018-06-15 19:42:54 +00:00
|
|
|
|
|
|
|
`iter_wordlist(lang, wordlist='best')` iterates through all the words in a
|
|
|
|
wordlist, in descending frequency order.
|
|
|
|
|
|
|
|
`get_frequency_dict(lang, wordlist='best')` returns all the frequencies in
|
|
|
|
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
|
|
|
words and don't need the wrapper that `word_frequency` provides.
|
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
`available_languages(wordlist='best')` returns a dictionary whose keys are
|
2018-06-15 19:42:54 +00:00
|
|
|
language codes, and whose values are the data file that will be loaded to
|
|
|
|
provide the requested wordlist in each language.
|
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
`get_language_info(lang)` returns a dictionary of information about how we
|
|
|
|
preprocess text in this language, such as what script we expect it to be
|
|
|
|
written in, which characters we normalize together, and how we tokenize it.
|
|
|
|
See its docstring for more information.
|
|
|
|
|
2018-06-15 19:42:54 +00:00
|
|
|
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
|
|
|
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
|
|
|
will select each random word from 2^n words.
|
|
|
|
|
|
|
|
If you happen to want an easy way to get [a memorable, xkcd-style
|
|
|
|
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
|
|
|
job. In this case, you should actually run the similar function
|
|
|
|
`random_ascii_words`, limiting the selection to words that can be typed in
|
|
|
|
ASCII. But maybe you should just use [xkpa][].
|
|
|
|
|
|
|
|
[xkcd936]: https://xkcd.com/936/
|
|
|
|
[xkpa]: https://github.com/beala/xkcd-password
|
|
|
|
|
2015-08-25 21:44:34 +00:00
|
|
|
## Tokenization
|
|
|
|
|
|
|
|
wordfreq uses the Python package `regex`, which is a more advanced
|
|
|
|
implementation of regular expressions than the standard library, to
|
|
|
|
separate text into tokens that can be counted consistently. `regex`
|
|
|
|
produces tokens that follow the recommendations in [Unicode
|
2016-01-22 21:23:43 +00:00
|
|
|
Annex #29, Text Segmentation][uax29], including the optional rule that
|
|
|
|
splits words between apostrophes and vowels.
|
2015-08-25 21:44:34 +00:00
|
|
|
|
2018-07-23 15:21:44 +00:00
|
|
|
There are exceptions where we change the tokenization to work better
|
|
|
|
with certain languages:
|
2015-08-25 21:44:34 +00:00
|
|
|
|
2016-07-28 23:23:17 +00:00
|
|
|
- In Arabic and Hebrew, it additionally normalizes ligatures and removes
|
|
|
|
combining marks.
|
|
|
|
|
|
|
|
- In Japanese and Korean, instead of using the regex library, it uses the
|
|
|
|
external library `mecab-python3`. This is an optional dependency of wordfreq,
|
|
|
|
and compiling it requires the `libmecab-dev` system package to be installed.
|
|
|
|
|
2015-09-05 07:42:54 +00:00
|
|
|
- In Chinese, it uses the external Python library `jieba`, another optional
|
|
|
|
dependency.
|
2015-08-25 21:44:34 +00:00
|
|
|
|
2018-07-23 15:21:44 +00:00
|
|
|
- While the @ sign is usually considered a symbol and not part of a word,
|
|
|
|
wordfreq will allow a word to end with "@" or "@s". This is one way of
|
|
|
|
writing gender-neutral words in Spanish and Portuguese.
|
|
|
|
|
2015-08-25 21:44:34 +00:00
|
|
|
[uax29]: http://unicode.org/reports/tr29/
|
|
|
|
|
2015-08-28 21:45:50 +00:00
|
|
|
When wordfreq's frequency lists are built in the first place, the words are
|
|
|
|
tokenized according to this function.
|
|
|
|
|
2018-07-23 15:21:44 +00:00
|
|
|
>>> from wordfreq import tokenize
|
|
|
|
>>> tokenize('l@s niñ@s', 'es')
|
|
|
|
['l@s', 'niñ@s']
|
|
|
|
>>> zipf_frequency('l@s', 'es')
|
2021-03-29 20:18:08 +00:00
|
|
|
3.03
|
2018-07-23 15:21:44 +00:00
|
|
|
|
2015-08-28 21:45:50 +00:00
|
|
|
Because tokenization in the real world is far from consistent, wordfreq will
|
|
|
|
also try to deal gracefully when you query it with texts that actually break
|
|
|
|
into multiple tokens:
|
|
|
|
|
2016-01-22 21:23:43 +00:00
|
|
|
>>> zipf_frequency('New York', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
5.32
|
2016-01-22 21:23:43 +00:00
|
|
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
2021-03-29 20:18:08 +00:00
|
|
|
3.29
|
2015-08-28 21:45:50 +00:00
|
|
|
|
|
|
|
The word frequencies are combined with the half-harmonic-mean function in order
|
2015-09-28 16:58:20 +00:00
|
|
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
|
|
|
where the word breaks must be inferred from the frequency of the resulting
|
|
|
|
words, there is also a penalty to the word frequency for each word break that
|
|
|
|
must be inferred.
|
|
|
|
|
|
|
|
This method of combining word frequencies implicitly assumes that you're asking
|
|
|
|
about words that frequently appear together. It's not multiplying the
|
|
|
|
frequencies, because that would assume they are statistically unrelated. So if
|
|
|
|
you give it an uncommon combination of tokens, it will hugely over-estimate
|
|
|
|
their frequency:
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2016-01-22 21:23:43 +00:00
|
|
|
>>> zipf_frequency('owl-flavored', 'en')
|
2021-03-29 20:18:08 +00:00
|
|
|
3.3
|
2015-08-28 21:45:50 +00:00
|
|
|
|
2017-01-07 00:04:40 +00:00
|
|
|
## Multi-script languages
|
|
|
|
|
|
|
|
Two of the languages we support, Serbian and Chinese, are written in multiple
|
|
|
|
scripts. To avoid spurious differences in word frequencies, we automatically
|
|
|
|
transliterate the characters in these languages when looking up their words.
|
|
|
|
|
|
|
|
Serbian text written in Cyrillic letters is automatically converted to Latin
|
|
|
|
letters, using standard Serbian transliteration, when the requested language is
|
|
|
|
`sr` or `sh`. If you request the word list as `hr` (Croatian) or `bs`
|
|
|
|
(Bosnian), no transliteration will occur.
|
|
|
|
|
|
|
|
Chinese text is converted internally to a representation we call
|
|
|
|
"Oversimplified Chinese", where all Traditional Chinese characters are replaced
|
|
|
|
with their Simplified Chinese equivalent, *even if* they would not be written
|
|
|
|
that way in context. This representation lets us use a straightforward mapping
|
|
|
|
that matches both Traditional and Simplified words, unifying their frequencies
|
|
|
|
when appropriate, and does not appear to create clashes between unrelated words.
|
|
|
|
|
|
|
|
Enumerating the Chinese wordlist will produce some unfamiliar words, because
|
|
|
|
people don't actually write in Oversimplified Chinese, and because in
|
|
|
|
practice Traditional and Simplified Chinese also have different word usage.
|
|
|
|
|
|
|
|
## Similar, overlapping, and varying languages
|
|
|
|
|
|
|
|
As much as we would like to give each language its own distinct code and its
|
|
|
|
own distinct word list with distinct source data, there aren't actually sharp
|
|
|
|
boundaries between languages.
|
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
Sometimes, it's convenient to pretend that the boundaries between languages
|
|
|
|
coincide with national borders, following the maxim that "a language is a
|
|
|
|
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
|
|
|
|
linguistic situation and the political situation diverge. Moreover, some of our
|
|
|
|
data sources rely on language detection, which of course has no idea which
|
|
|
|
country the writer of the text belongs to.
|
2017-01-07 00:04:40 +00:00
|
|
|
|
|
|
|
So we've had to make some arbitrary decisions about how to represent the
|
|
|
|
fuzzier language boundaries, such as those within Chinese, Malay, and
|
2022-03-11 15:43:37 +00:00
|
|
|
Croatian/Bosnian/Serbian.
|
2017-01-07 00:04:40 +00:00
|
|
|
|
|
|
|
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
|
|
|
|
module to find the best match for a language code. If you ask for word
|
|
|
|
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
|
|
|
Simplified Chinese), you will get the `zh` wordlist, for example.
|
|
|
|
|
2017-08-25 21:38:31 +00:00
|
|
|
## Additional CJK installation
|
|
|
|
|
|
|
|
Chinese, Japanese, and Korean have additional external dependencies so that
|
2021-02-18 23:25:16 +00:00
|
|
|
they can be tokenized correctly. They can all be installed at once by requesting
|
|
|
|
the 'cjk' feature:
|
2017-08-25 21:38:31 +00:00
|
|
|
|
2021-02-18 23:25:16 +00:00
|
|
|
pip install wordfreq[cjk]
|
2017-08-25 21:38:31 +00:00
|
|
|
|
2022-03-11 15:43:37 +00:00
|
|
|
You can put `wordfreq[cjk]` in a list of dependencies, such as the
|
|
|
|
`[tool.poetry.dependencies]` list of your own project.
|
|
|
|
|
2021-02-18 23:25:16 +00:00
|
|
|
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
|
2021-03-29 20:41:47 +00:00
|
|
|
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
|
2021-02-18 23:25:16 +00:00
|
|
|
and `mecab-ko-dic`.
|
2017-08-25 21:38:31 +00:00
|
|
|
|
2021-02-18 23:25:16 +00:00
|
|
|
As of version 2.4.2, you no longer have to install dictionaries separately.
|
2017-08-25 21:38:31 +00:00
|
|
|
|
2014-06-02 20:37:32 +00:00
|
|
|
## License
|
|
|
|
|
2022-10-25 18:20:23 +00:00
|
|
|
`wordfreq` is freely redistributable under the Apache license (see
|
|
|
|
`LICENSE.txt`), and it includes data files that may be
|
2015-05-13 08:09:34 +00:00
|
|
|
redistributed under a Creative Commons Attribution-ShareAlike 4.0
|
2022-03-11 00:12:45 +00:00
|
|
|
license (<https://creativecommons.org/licenses/by-sa/4.0/>).
|
2014-06-02 20:37:32 +00:00
|
|
|
|
2015-05-13 08:09:34 +00:00
|
|
|
`wordfreq` contains data extracted from Google Books Ngrams
|
2022-03-11 00:12:45 +00:00
|
|
|
(<http://books.google.com/ngrams>) and Google Books Syntactic Ngrams
|
|
|
|
(<http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html>).
|
2015-05-13 08:09:34 +00:00
|
|
|
The terms of use of this data are:
|
2014-06-02 20:37:32 +00:00
|
|
|
|
|
|
|
Ngram Viewer graphs and data may be freely used for any purpose, although
|
|
|
|
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
|
|
|
|
of a link to http://books.google.com/ngrams, would be appreciated.
|
|
|
|
|
2016-07-15 19:10:25 +00:00
|
|
|
`wordfreq` also contains data derived from the following Creative Commons-licensed
|
2015-05-13 08:09:34 +00:00
|
|
|
sources:
|
|
|
|
|
|
|
|
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
2022-03-11 00:12:45 +00:00
|
|
|
Studies (<http://corpus.leeds.ac.uk/list.html>)
|
2015-05-13 08:09:34 +00:00
|
|
|
|
2022-03-11 00:12:45 +00:00
|
|
|
- Wikipedia, the free encyclopedia (<http://www.wikipedia.org>)
|
2015-05-13 08:09:34 +00:00
|
|
|
|
2022-03-11 00:12:45 +00:00
|
|
|
- ParaCrawl, a multilingual Web crawl (<https://paracrawl.eu>)
|
2018-06-18 19:15:07 +00:00
|
|
|
|
2018-04-26 19:53:07 +00:00
|
|
|
It contains data from OPUS OpenSubtitles 2018
|
2022-03-11 00:12:45 +00:00
|
|
|
(<http://opus.nlpl.eu/OpenSubtitles.php>), whose data originates from the
|
|
|
|
OpenSubtitles project (<http://www.opensubtitles.org/>) and may be used with
|
2018-06-18 19:15:07 +00:00
|
|
|
attribution to OpenSubtitles.
|
2017-01-07 00:04:40 +00:00
|
|
|
|
2015-09-08 21:43:16 +00:00
|
|
|
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
2015-09-22 18:23:55 +00:00
|
|
|
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
|
|
|
(see citations below) and available at
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://crr.ugent.be/programs-data/subtitle-frequencies>.
|
2015-09-04 04:57:04 +00:00
|
|
|
|
2015-09-22 18:23:55 +00:00
|
|
|
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
|
|
|
|
distribute these wordlists in wordfreq, to be used for any purpose, not just
|
|
|
|
for academic use, under these conditions:
|
2015-09-03 22:56:56 +00:00
|
|
|
|
|
|
|
- Wordfreq and code derived from it must credit the SUBTLEX authors.
|
|
|
|
- It must remain clear that SUBTLEX is freely available data.
|
|
|
|
|
|
|
|
These terms are similar to the Creative Commons Attribution-ShareAlike license.
|
|
|
|
|
2015-05-13 08:09:34 +00:00
|
|
|
Some additional data was collected by a custom application that watches the
|
|
|
|
streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
2015-08-25 21:44:34 +00:00
|
|
|
Policy. This software gives statistics about words that are commonly used on
|
|
|
|
Twitter; it does not display or republish any Twitter content.
|
2015-09-04 04:57:04 +00:00
|
|
|
|
2024-06-24 23:02:22 +00:00
|
|
|
## Can I convert wordfreq to a more convenient form for my purposes, like a CSV file?
|
|
|
|
|
|
|
|
No. The CSV format does not have any space for attribution or license
|
|
|
|
information, and therefore does not follow the CC-By-SA license. Even if you
|
|
|
|
tried to include the proper attribution in a header or in another file, someone
|
|
|
|
would likely just strip it out.
|
|
|
|
|
|
|
|
wordfreq isn't particularly separable from its code, anyway. It depends on its
|
|
|
|
normalization and word segmentation process, which is implemented in Python
|
|
|
|
code, to give appropriate results.
|
|
|
|
|
|
|
|
A reasonable way to transform wordfreq would be to port the library to another
|
|
|
|
programming language, with all credits included and packaged in the usual way
|
|
|
|
for that language.
|
|
|
|
|
|
|
|
|
2016-09-12 22:24:55 +00:00
|
|
|
## Citing wordfreq
|
|
|
|
|
|
|
|
If you use wordfreq in your research, please cite it! We publish the code
|
|
|
|
through Zenodo so that it can be reliably cited using a DOI. The current
|
|
|
|
citation is:
|
|
|
|
|
2022-10-25 18:20:23 +00:00
|
|
|
> Robyn Speer. (2022). rspeer/wordfreq: v3.0 (v3.0.2). Zenodo. https://doi.org/10.5281/zenodo.7199437
|
2016-09-12 22:24:55 +00:00
|
|
|
|
|
|
|
The same citation in BibTex format:
|
|
|
|
|
|
|
|
```
|
2022-10-25 18:20:23 +00:00
|
|
|
@software{robyn_speer_2022_7199437,
|
|
|
|
author = {Robyn Speer},
|
|
|
|
title = {rspeer/wordfreq: v3.0},
|
|
|
|
month = sep,
|
|
|
|
year = 2022,
|
|
|
|
publisher = {Zenodo},
|
|
|
|
version = {v3.0.2},
|
|
|
|
doi = {10.5281/zenodo.7199437},
|
|
|
|
url = {https://doi.org/10.5281/zenodo.7199437}
|
2016-09-12 22:24:55 +00:00
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2015-09-04 04:57:04 +00:00
|
|
|
## Citations to work that wordfreq is built on
|
|
|
|
|
2017-01-06 00:18:06 +00:00
|
|
|
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
|
|
|
|
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
|
|
|
|
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
|
|
|
|
Machine Translation.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://www.statmt.org/wmt15/results.html>
|
2017-01-06 00:18:06 +00:00
|
|
|
|
2015-09-04 04:57:04 +00:00
|
|
|
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
|
|
|
|
Evaluation of Current Word Frequency Norms and the Introduction of a New and
|
|
|
|
Improved Word Frequency Measure for American English. Behavior Research
|
|
|
|
Methods, 41 (4), 977-990.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf>
|
2015-09-04 04:57:04 +00:00
|
|
|
|
2015-09-08 21:43:16 +00:00
|
|
|
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
|
|
|
(2011). The word frequency effect: A review of recent developments and
|
|
|
|
implications for the choice of frequency estimates in German. Experimental
|
|
|
|
Psychology, 58, 412-424.
|
|
|
|
|
2015-09-04 04:57:04 +00:00
|
|
|
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
|
|
|
|
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729>
|
2015-09-04 04:57:04 +00:00
|
|
|
|
|
|
|
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://unicode.org/reports/tr29/>
|
2015-09-04 04:57:04 +00:00
|
|
|
|
2017-01-06 00:18:06 +00:00
|
|
|
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
|
|
|
|
(2004). Creating open language resources for Hungarian. In Proceedings of the
|
|
|
|
4th international conference on Language Resources and Evaluation (LREC2004).
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://mokk.bme.hu/resources/webcorpus/>
|
2017-01-06 00:18:06 +00:00
|
|
|
|
2015-09-04 19:57:40 +00:00
|
|
|
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
|
|
|
|
measure for Dutch words based on film subtitles. Behavior Research Methods,
|
|
|
|
42(3), 643-650.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf>
|
2015-09-04 19:57:40 +00:00
|
|
|
|
2015-09-04 04:57:04 +00:00
|
|
|
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
|
|
|
|
analyzer.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://mecab.sourceforge.net/>
|
2015-09-04 04:57:04 +00:00
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
|
|
|
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
|
|
|
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://aclweb.org/anthology/P12-3029>
|
2018-06-18 19:15:07 +00:00
|
|
|
|
2017-01-07 00:04:40 +00:00
|
|
|
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
|
|
|
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
|
|
|
International Conference on Language Resources and Evaluation (LREC 2016).
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>
|
2017-01-07 00:04:40 +00:00
|
|
|
|
2021-03-30 16:56:10 +00:00
|
|
|
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
|
|
|
|
for processing huge corpora on medium to low resource infrastructures. In
|
|
|
|
Proceedings of the Workshop on Challenges in the Management of Large Corpora
|
|
|
|
(CMLC-7) 2019.
|
2022-03-11 00:12:45 +00:00
|
|
|
<https://oscar-corpus.com/publication/2019/clmc7/asynchronous/>
|
2021-03-30 16:56:10 +00:00
|
|
|
|
2018-06-18 19:15:07 +00:00
|
|
|
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
2022-03-11 00:12:45 +00:00
|
|
|
European Languages. <https://paracrawl.eu/>
|
2018-06-18 19:15:07 +00:00
|
|
|
|
2015-09-04 04:57:04 +00:00
|
|
|
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
|
|
|
SUBTLEX-UK: A new and improved word frequency database for British English.
|
|
|
|
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
2022-03-11 00:12:45 +00:00
|
|
|
<http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521>
|