2013-10-28 23:26:44 +00:00
|
|
|
Tools for working with word frequencies from various corpora.
|
|
|
|
|
|
|
|
Author: Rob Speer
|
2014-06-02 20:37:32 +00:00
|
|
|
|
2015-05-28 18:02:12 +00:00
|
|
|
## Installation
|
|
|
|
|
|
|
|
wordfreq requires Python 3 and depends on a few other Python modules
|
|
|
|
(msgpack-python, langcodes, and ftfy). You can install it and its dependencies
|
|
|
|
in the usual way, either by getting it from pip:
|
|
|
|
|
|
|
|
pip3 install wordfreq
|
|
|
|
|
|
|
|
or by getting the repository and running its setup.py:
|
|
|
|
|
|
|
|
python3 setup.py install
|
|
|
|
|
|
|
|
To handle word frequency lookups in Japanese, you need to additionally install
|
|
|
|
mecab-python3, which itself depends on libmecab-dev. These commands will
|
|
|
|
install them on Ubuntu:
|
|
|
|
|
|
|
|
sudo apt-get install mecab-ipadic-utf8 libmecab-dev
|
|
|
|
pip3 install mecab-python3
|
|
|
|
|
2015-07-08 22:48:33 +00:00
|
|
|
## Unicode data
|
|
|
|
|
|
|
|
The tokenizers used to split non-Japanese phrases use regexes built using the
|
|
|
|
`unicodedata` module from Python 3.4, which uses Unicode version 6.3.0. To
|
|
|
|
update these regexes, run `scripts/gen_regex.py`.
|
|
|
|
|
2014-06-02 20:37:32 +00:00
|
|
|
## License
|
|
|
|
|
2015-05-13 08:09:34 +00:00
|
|
|
`wordfreq` is freely redistributable under the MIT license (see
|
|
|
|
`MIT-LICENSE.txt`), and it includes data files that may be
|
|
|
|
redistributed under a Creative Commons Attribution-ShareAlike 4.0
|
|
|
|
license (https://creativecommons.org/licenses/by-sa/4.0/).
|
2014-06-02 20:37:32 +00:00
|
|
|
|
2015-05-13 08:09:34 +00:00
|
|
|
`wordfreq` contains data extracted from Google Books Ngrams
|
|
|
|
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams
|
|
|
|
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html).
|
|
|
|
The terms of use of this data are:
|
2014-06-02 20:37:32 +00:00
|
|
|
|
|
|
|
Ngram Viewer graphs and data may be freely used for any purpose, although
|
|
|
|
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
|
|
|
|
of a link to http://books.google.com/ngrams, would be appreciated.
|
|
|
|
|
2015-05-13 08:09:34 +00:00
|
|
|
It also contains data derived from the following Creative Commons-licensed
|
|
|
|
sources:
|
|
|
|
|
|
|
|
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
|
|
|
Studies (http://corpus.leeds.ac.uk/list.html)
|
|
|
|
|
|
|
|
- The OpenSubtitles Frequency Word Lists, by Invoke IT Limited
|
|
|
|
(https://invokeit.wordpress.com/frequency-word-lists/)
|
|
|
|
|
|
|
|
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
|
|
|
|
|
|
|
Some additional data was collected by a custom application that watches the
|
|
|
|
streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
|
|
|
Policy. This software only gives statistics about words that are very commonly
|
|
|
|
used on Twitter; it does not display or republish any Twitter content.
|
2015-05-28 18:02:12 +00:00
|
|
|
|