wordfreq/wordfreq_data/luminoso/README.txt
Rob Speer 9fd9028d3c Add wordfreq_data files.
Now the build process is repeatable from scratch, even if something goes
wrong with the download server.


Former-commit-id: 26c0d7dd28
2013-10-31 13:39:02 -04:00

18 lines
795 B
Plaintext

This directory contains two wordlists we've put together at Luminoso for our
own purposes. You might find them useful as well.
* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
The words are not distinguished by language.
* `multi` combines various sources of data in different languages, including:
* Google Books, for English
* A smaller corpus of tweets that supposedly come from English speakers
(there's still a lot of non-English text in there)
* the Leeds corpora for various languages (see `../leeds/README.txt`)
We would like to release the tools that built `twitter-52M` as soon as they are
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
we happen to already be using, but you might find it useful as well.