wordfreq/wordfreq_data/luminoso/README.txt

This directory contains two wordlists we've put together at Luminoso for our
own purposes. You might find them useful as well.

* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
  The words are not distinguished by language.

* `multi` combines various sources of data in different languages, including:

  * Google Books, for English
  * A smaller corpus of tweets that supposedly come from English speakers
    (there's still a lot of non-English text in there)
  * the Leeds corpora for various languages (see `../leeds/README.txt`)

We would like to release the tools that built `twitter-52M` as soon as they are
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
we happen to already be using, but you might find it useful as well.
Add wordfreq_data files. Now the build process is repeatable from scratch, even if something goes wrong with the download server. 2013-10-31 17:39:02 +00:00			`This directory contains two wordlists we've put together at Luminoso for our`
			`own purposes. You might find them useful as well.`

			* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
			`The words are not distinguished by language.`

			* `multi` combines various sources of data in different languages, including:

			`* Google Books, for English`
			`* A smaller corpus of tweets that supposedly come from English speakers`
			`(there's still a lot of non-English text in there)`
			* the Leeds corpora for various languages (see `../leeds/README.txt`)

			We would like to release the tools that built `twitter-52M` as soon as they are
			less sloppy. `multi` is a dataset that is mainly relevant because it's the data
			`we happen to already be using, but you might find it useful as well.`