Add wordfreq_data files.

Now the build process is repeatable from scratch, even if something goes
wrong with the download server.


Former-commit-id: 26c0d7dd28
This commit is contained in:
Robyn Speer 2013-10-31 13:39:02 -04:00
parent 101e767ad9
commit 9163a67a9f
19 changed files with 595110 additions and 1 deletions

1
.gitignore vendored
View File

@ -6,5 +6,4 @@ dist
pip-log.txt
.coverage
*~
wordfreq_data/
wordfreq-data.tar.gz

View File

@ -0,0 +1,10 @@
This data was compiled from the Google Books Ngram Viewer data, particularly
the 2012 English dataset.
The data is available from https://books.google.com/ngrams. The terms of use of
this data are:
"Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a
link to http://books.google.com/ngrams, would be appreciated."

View File

@ -0,0 +1 @@
48b238cc5b3d359d0e8ac48f6321aca27c1ec098

View File

@ -0,0 +1,5 @@
These wordlists come from the University of Leeds Centre for Translation
Studies, and are provided for free under a Creative Commons Attribution
license.
For more information, see: http://corpus.leeds.ac.uk/list.html

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,17 @@
This directory contains two wordlists we've put together at Luminoso for our
own purposes. You might find them useful as well.
* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
The words are not distinguished by language.
* `multi` combines various sources of data in different languages, including:
* Google Books, for English
* A smaller corpus of tweets that supposedly come from English speakers
(there's still a lot of non-English text in there)
* the Leeds corpora for various languages (see `../leeds/README.txt`)
We would like to release the tools that built `twitter-52M` as soon as they are
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
we happen to already be using, but you might find it useful as well.

View File

@ -0,0 +1 @@
f24577ba6807c884bca4464a8624beda68d8df79

View File

@ -0,0 +1 @@
4c5a66db8a4190a173814a4d7b31b925c5b131d1