Access a database of word frequencies, in various natural languages.
Go to file
Joshua Chin 631a5f1b71 removed mkdir -p for many cases
Former-commit-id: 98a7a8093b
2015-07-17 14:45:22 -04:00
scripts added comment about parsing_range 2015-07-10 14:27:48 -04:00
tests Express the combining of word frequencies in an explicitly associative and commutative way. 2015-07-09 15:29:05 -04:00
wordfreq Remove an unused temporary. 2015-07-09 15:50:44 -04:00
wordfreq_builder removed mkdir -p for many cases 2015-07-17 14:45:22 -04:00
.gitignore Add wordfreq_data files. 2013-10-31 13:39:02 -04:00
MANIFEST.in removes combining marks from arabic words instead of treating them as punctuation 2015-06-25 12:36:41 -04:00
MIT-LICENSE.txt Update the copyright year in the license 2015-06-18 18:55:59 -04:00
README.md Document the version of Unicode used to build the regexes. 2015-07-08 18:48:33 -04:00
setup.py declare 'mecab' as an extra 2015-07-02 17:11:51 -04:00

Tools for working with word frequencies from various corpora.

Author: Rob Speer

Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev
pip3 install mecab-python3

Unicode data

The tokenizers used to split non-Japanese phrases use regexes built using the unicodedata module from Python 3.4, which uses Unicode version 6.3.0. To update these regexes, run scripts/gen_regex.py.

License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software only gives statistics about words that are very commonly used on Twitter; it does not display or republish any Twitter content.