Add wordfreq_data files.

Now the build process is repeatable from scratch, even if something goes
wrong with the download server.
This commit is contained in:
Rob Speer 2013-10-31 13:39:02 -04:00
parent 2cf812a64e
commit 26c0d7dd28
19 changed files with 4152042 additions and 1 deletions

1
.gitignore vendored
View File

@ -6,5 +6,4 @@ dist
pip-log.txt
.coverage
*~
wordfreq_data/
wordfreq-data.tar.gz

View File

@ -0,0 +1,10 @@
This data was compiled from the Google Books Ngram Viewer data, particularly
the 2012 English dataset.
The data is available from https://books.google.com/ngrams. The terms of use of
this data are:
"Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a
link to http://books.google.com/ngrams, would be appreciated."

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,5 @@
These wordlists come from the University of Leeds Centre for Translation
Studies, and are provided for free under a Creative Commons Attribution
license.
For more information, see: http://corpus.leeds.ac.uk/list.html

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,17 @@
This directory contains two wordlists we've put together at Luminoso for our
own purposes. You might find them useful as well.
* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
The words are not distinguished by language.
* `multi` combines various sources of data in different languages, including:
* Google Books, for English
* A smaller corpus of tweets that supposedly come from English speakers
(there's still a lot of non-English text in there)
* the Leeds corpora for various languages (see `../leeds/README.txt`)
We would like to release the tools that built `twitter-52M` as soon as they are
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
we happen to already be using, but you might find it useful as well.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff