Access a database of word frequencies, in various natural languages.
Go to file
Rob Speer 77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
scripts Add SUBTLEX as a source of English and Chinese data 2015-09-03 18:13:13 -04:00
tests Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
wordfreq update data files (without the CLD2 fix yet) 2015-09-04 14:58:20 -04:00
wordfreq_builder Use SUBTLEX for German, but OpenSubtitles for Greek 2015-09-04 15:52:21 -04:00
.gitignore support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
MANIFEST.in removes combining marks from arabic words instead of treating them as punctuation 2015-06-25 12:36:41 -04:00
MIT-LICENSE.txt Update the copyright year in the license 2015-06-18 18:55:59 -04:00
README.md Use SUBTLEX for German, but OpenSubtitles for Greek 2015-09-04 15:52:21 -04:00
setup.py bump to version 1.1 2015-08-25 17:44:52 -04:00

Tools for working with word frequencies from various corpora.

Author: Rob Speer

Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev
pip3 install mecab-python3

Usage

wordfreq provides access to estimates of the frequency with which a word is used, in 15 languages (see Supported languages below). It loads efficiently-packed data structures that contain all words that appear at least once per million words.

The most useful function is:

word_frequency(word, lang, wordlist='combined', minimum=0.0)

This function looks up a word's frequency in the given language, returning its frequency as a decimal between 0 and 1. In these examples, we'll multiply the frequencies by a million (1e6) to get more readable numbers:

>>> from wordfreq import word_frequency
>>> word_frequency('cafe', 'en') * 1e6
14.45439770745928

>>> word_frequency('café', 'en') * 1e6
4.7863009232263805

>>> word_frequency('cafe', 'fr') * 1e6
2.0417379446695274

>>> word_frequency('café', 'fr') * 1e6
77.62471166286912

The parameters are:

  • word: a Unicode string containing the word to look up. Ideally the word is a single token according to our tokenizer, but if not, there is still hope -- see Tokenization below.

  • lang: the BCP 47 or ISO 639 code of the language to use, such as 'en'.

  • wordlist: which set of word frequencies to use. Current options are 'combined', which combines up to five different sources, and 'twitter', which returns frequencies observed on Twitter alone.

  • minimum: If the word is not in the list or has a frequency lower than minimum, return minimum instead. In some applications, you'll want to set minimum=1e-6 to avoid a discontinuity where the list ends, because a frequency of 1e-6 (1 per million) is the threshold for being included in the list at all.

Other functions:

tokenize(text, lang) splits text in the given language into words, in the same way that the words in wordfreq's data were counted in the first place. See Tokenization. Tokenizing Japanese requires the optional dependency mecab-python3 to be installed.

top_n_list(lang, n, wordlist='combined') returns the most common n words in the list, in descending frequency order.

>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']

>>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']

iter_wordlist(lang, wordlist='combined') iterates through all the words in a wordlist, in descending frequency order.

get_frequency_dict(lang, wordlist='combined') returns all the frequencies in a wordlist as a dictionary, for cases where you'll want to look up a lot of words and don't need the wrapper that word_frequency provides.

supported_languages(wordlist='combined') returns a dictionary whose keys are language codes, and whose values are the data file that will be loaded to provide the requested wordlist in each language.

random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12) returns a selection of random words, separated by spaces. bits_per_word=n will select each random word from 2^n words.

If you happen to want an easy way to get a memorable, xkcd-style password with 60 bits of entropy, this function will almost do the job. In this case, you should actually run the similar function random_ascii_words, limiting the selection to words that can be typed in ASCII.

Sources and supported languages

We compiled word frequencies from five different sources, providing us examples of word usage on different topics at different levels of formality. The sources (and the abbreviations we'll use for them) are:

  • GBooks: Google Books Ngrams 2013
  • LeedsIC: The Leeds Internet Corpus
  • OpenSub: OpenSubtitles
  • SUBTLEX: The SUBTLEX word frequency lists
  • Twitter: Messages sampled from Twitter's public stream
  • Wikipedia: The full text of Wikipedia in 2015

The following 12 languages are well-supported, with reasonable tokenization and at least 3 different sources of word frequencies:

Language    Code    GBooks  SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Arabic      ar    │ -       -       Yes     Yes     Yes     Yes
German      de    │ -       Yes     Yes     Yes     Yes[1]  Yes
Greek       el    │ -       -       Yes     Yes     Yes     Yes
English     en    │ Yes     Yes     Yes     Yes     Yes     Yes
Spanish     es    │ -       -       Yes     Yes     Yes     Yes
French      fr    │ -       -       Yes     Yes     Yes     Yes
Indonesian  id    │ -       -       -       Yes     Yes     Yes
Italian     it    │ -       -       Yes     Yes     Yes     Yes
Japanese    ja    │ -       -       Yes     -       Yes     Yes
Malay       ms    │ -       -       -       Yes     Yes     Yes
Dutch       nl    │ -       Yes     -       Yes     Yes     Yes
Portuguese  pt    │ -       -       Yes     Yes     Yes     Yes
Russian     ru    │ -       -       Yes     Yes     Yes     Yes
Turkish     tr    │ -       -       -       Yes     Yes     Yes

These languages are only marginally supported so far. We have too few data sources so far in Korean (feel free to suggest some), and we are lacking tokenization support for Chinese.

Language    Code    GBooks  SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Korean      ko    │ -       -       -       -       Yes     Yes
Chinese     zh    │ -       Yes     Yes     Yes     -       -

[1] We've counted the frequencies from tweets in German, such as they are, but you should be aware that German is not a frequently-used language on Twitter. Germans just don't tweet that much.

Tokenization

wordfreq uses the Python package regex, which is a more advanced implementation of regular expressions than the standard library, to separate text into tokens that can be counted consistently. regex produces tokens that follow the recommendations in Unicode Annex #29, Text Segmentation.

There are language-specific exceptions:

  • In Arabic, it additionally normalizes ligatures and removes combining marks.
  • In Japanese, instead of using the regex library, it uses the external library mecab-python3. This is an optional dependency of wordfreq, and compiling it requires the libmecab-dev system package to be installed.
  • It does not yet attempt to tokenize Chinese ideograms.

When wordfreq's frequency lists are built in the first place, the words are tokenized according to this function.

Because tokenization in the real world is far from consistent, wordfreq will also try to deal gracefully when you query it with texts that actually break into multiple tokens:

>>> word_frequency('New York', 'en')
0.0002632772081925718

The word frequencies are combined with the half-harmonic-mean function in order to provide an estimate of what their combined frequency would be.

This implicitly assumes that you're asking about words that frequently appear together. It's not multiplying the frequencies, because that would assume they are statistically unrelated. So if you give it an uncommon combination of tokens, it will hugely over-estimate their frequency:

>>> word_frequency('owl-flavored', 'en')
1.3557098723512335e-06

License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and SUBTLEX-CH, created by Marc Brysbaert et al. and available at http://crr.ugent.be/programs-data/subtitle-frequencies. SUBTLEX was first published in this paper:

I (Rob Speer) have obtained permission by e-mail from Marc Brysbaert to distribute these wordlists in wordfreq, to be used for any purpose, not just for academic use, under these conditions:

  • Wordfreq and code derived from it must credit the SUBTLEX authors.
  • It must remain clear that SUBTLEX is freely available data.

These terms are similar to the Creative Commons Attribution-ShareAlike license.

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content.

Citations to work that wordfreq is built on