mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
Access a database of word frequencies, in various natural languages.
7c6cf84749
Former-commit-id:
|
||
---|---|---|
tests | ||
wordfreq | ||
.gitignore | ||
MANIFEST.in | ||
MIT-LICENSE | ||
README.txt | ||
setup.py |
Tools for working with word frequencies from various corpora. Author: Robyn Speer ## License `wordfreq` is freely redistributable under the MIT license (see `MIT-LICENSE.txt`), and it includes data files that may be redistributed under a Creative Commons Attribution-ShareAlike 4.0 license (https://creativecommons.org/licenses/by-sa/4.0/). `wordfreq` contains data extracted from Google Books Ngrams (http://books.google.com/ngrams) and Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html). The terms of use of this data are: Ngram Viewer graphs and data may be freely used for any purpose, although acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a link to http://books.google.com/ngrams, would be appreciated. It also contains data derived from the following Creative Commons-licensed sources: - The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html) - The OpenSubtitles Frequency Word Lists, by Invoke IT Limited (https://invokeit.wordpress.com/frequency-word-lists/) - Wikipedia, the free encyclopedia (http://www.wikipedia.org) Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software only gives statistics about words that are very commonly used on Twitter; it does not display or republish any Twitter content.