Access a database of word frequencies, in various natural languages.
Go to file
Rob Speer 5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
scripts remove obsolete gen_regex.py 2015-08-24 17:11:18 -04:00
tests Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
wordfreq Strip apostrophes from edges of tokens 2015-08-25 12:41:48 -04:00
wordfreq_builder Strip apostrophes from edges of tokens 2015-08-25 12:41:48 -04:00
.gitignore Add wordfreq_data files. 2013-10-31 13:39:02 -04:00
MANIFEST.in removes combining marks from arabic words instead of treating them as punctuation 2015-06-25 12:36:41 -04:00
MIT-LICENSE.txt Update the copyright year in the license 2015-06-18 18:55:59 -04:00
README.md no use for use 2015-07-17 14:46:40 -04:00
setup.py Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00

Tools for working with word frequencies from various corpora.

Author: Rob Speer

Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev
pip3 install mecab-python3

Unicode data

The tokenizers that split non-Japanese phrases utilize regexes built using the unicodedata module from Python 3.4, which supports Unicode version 6.3.0. To update these regexes, run scripts/gen_regex.py.

License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software only gives statistics about words that are very commonly used on Twitter; it does not display or republish any Twitter content.