mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-28 19:38:51 +00:00
18 lines
795 B
Plaintext
18 lines
795 B
Plaintext
|
This directory contains two wordlists we've put together at Luminoso for our
|
||
|
own purposes. You might find them useful as well.
|
||
|
|
||
|
* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
|
||
|
The words are not distinguished by language.
|
||
|
|
||
|
* `multi` combines various sources of data in different languages, including:
|
||
|
|
||
|
* Google Books, for English
|
||
|
* A smaller corpus of tweets that supposedly come from English speakers
|
||
|
(there's still a lot of non-English text in there)
|
||
|
* the Leeds corpora for various languages (see `../leeds/README.txt`)
|
||
|
|
||
|
We would like to release the tools that built `twitter-52M` as soon as they are
|
||
|
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
|
||
|
we happen to already be using, but you might find it useful as well.
|
||
|
|