wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Access a database of word frequencies, in various natural languages.

Go to file

Rob Speer a1c31d3390 remove old tests Former-commit-id: `410912d8f0`		2015-05-21 20:36:09 -04:00
tests	remove old tests	2015-05-21 20:36:09 -04:00
wordfreq	allow more language matches; reorder some parameters	2015-05-21 20:35:02 -04:00
.gitignore	Add wordfreq_data files.	2013-10-31 13:39:02 -04:00
MANIFEST.in	add new data files from wordfreq_builder	2015-05-11 18:45:47 -04:00
MIT-LICENSE	Add license text for the whole package	2014-06-02 16:37:32 -04:00
README.txt	update README, another setup fix	2015-05-13 04:09:34 -04:00
setup.py	update README, another setup fix	2015-05-13 04:09:34 -04:00

README.txt

Tools for working with word frequencies from various corpora.

Author: Rob Speer

## License

`wordfreq` is freely redistributable under the MIT license (see
`MIT-LICENSE.txt`), and it includes data files that may be
redistributed under a Creative Commons Attribution-ShareAlike 4.0
license (https://creativecommons.org/licenses/by-sa/4.0/).

`wordfreq` contains data extracted from Google Books Ngrams
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html).
The terms of use of this data are:

    Ngram Viewer graphs and data may be freely used for any purpose, although
    acknowledgement of Google Books Ngram Viewer as the source, and inclusion
    of a link to http://books.google.com/ngrams, would be appreciated.

It also contains data derived from the following Creative Commons-licensed
sources:

- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
  Studies (http://corpus.leeds.ac.uk/list.html)

- The OpenSubtitles Frequency Word Lists, by Invoke IT Limited
  (https://invokeit.wordpress.com/frequency-word-lists/)

- Wikipedia, the free encyclopedia (http://www.wikipedia.org)

Some additional data was collected by a custom application that watches the
streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software only gives statistics about words that are very commonly
used on Twitter; it does not display or republish any Twitter content.