wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Access a database of word frequencies, in various natural languages.

Go to file

Andrew Lin e6d9b36203 Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer		2015-08-27 11:56:19 -04:00
scripts	remove obsolete gen_regex.py	2015-08-24 17:11:18 -04:00
tests	Use the regex implementation of Unicode segmentation	2015-08-24 17:11:08 -04:00
wordfreq	update data files	2015-08-27 03:58:54 -04:00
wordfreq_builder	fix URL expression	2015-08-26 15:00:46 -04:00
.gitignore	Add wordfreq_data files.	2013-10-31 13:39:02 -04:00
MANIFEST.in	removes combining marks from arabic words instead of treating them as punctuation	2015-06-25 12:36:41 -04:00
MIT-LICENSE.txt	Update the copyright year in the license	2015-06-18 18:55:59 -04:00
README.md	update the README	2015-08-25 17:44:34 -04:00
setup.py	bump to version 1.1	2015-08-25 17:44:52 -04:00

README.md

Tools for working with word frequencies from various corpora.

Author: Rob Speer

Installation

wordfreq requires Python 3 and depends on a few other Python modules (msgpack-python, langcodes, and ftfy). You can install it and its dependencies in the usual way, either by getting it from pip:

pip3 install wordfreq

or by getting the repository and running its setup.py:

python3 setup.py install

To handle word frequency lookups in Japanese, you need to additionally install mecab-python3, which itself depends on libmecab-dev. These commands will install them on Ubuntu:

sudo apt-get install mecab-ipadic-utf8 libmecab-dev
pip3 install mecab-python3

Tokenization

wordfreq uses the Python package regex, which is a more advanced implementation of regular expressions than the standard library, to separate text into tokens that can be counted consistently. regex produces tokens that follow the recommendations in Unicode Annex #29, Text Segmentation.

There are language-specific exceptions:

In Arabic, it additionally normalizes ligatures and removes combining marks.
In Japanese, instead of using the regex library, it uses the external library mecab-python3. This is an optional dependency of wordfreq, and compiling it requires the libmecab-dev system package to be installed.
It does not yet attempt to tokenize Chinese ideograms.

License

wordfreq is freely redistributable under the MIT license (see MIT-LICENSE.txt), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/).

wordfreq contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are:

Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed sources:

The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html)
The OpenSubtitles Frequency Word Lists, by Invoke IT Limited (https://invokeit.wordpress.com/frequency-word-lists/)
Wikipedia, the free encyclopedia (http://www.wikipedia.org)

Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on Twitter; it does not display or republish any Twitter content.