mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-27 10:58:52 +00:00
af5f65b328
So far, this wordlist is only in Dutch. |
||
---|---|---|
.. | ||
multilingual.csv | ||
nl-combined-201503.csv | ||
README.txt | ||
twitter-52M.csv | ||
twitter-stems-2014-nl.csv | ||
twitter-stems-2014.csv | ||
twitter-surfaces-2014-nl.csv | ||
twitter-surfaces-2014.csv |
This directory contains two wordlists we've put together at Luminoso for our own purposes. You might find them useful as well. * `twitter-52M` collects the unigram word frequencies from 52 million tweets. The words are not distinguished by language. * `multi` combines various sources of data in different languages, including: * Google Books, for English * A smaller corpus of tweets that supposedly come from English speakers (there's still a lot of non-English text in there) * the Leeds corpora for various languages (see `../leeds/README.txt`) We would like to release the tools that built `twitter-52M` as soon as they are less sloppy. `multi` is a dataset that is mainly relevant because it's the data we happen to already be using, but you might find it useful as well.