wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

History

Rob Speer 873ace87db v0.7: make a proper Dutch 'surfaces' list		2015-04-30 13:01:24 -04:00
..
multilingual.csv	try to match the wordlist metanl actually uses	2013-10-31 15:13:22 -04:00
nl-combined-201503.csv	start a new multilingual wordlist called 'stems'	2015-03-31 15:59:30 -04:00
nl-combined-201504.csv	v0.7: make a proper Dutch 'surfaces' list	2015-04-30 13:01:24 -04:00
README.txt	Add wordfreq_data files.	2013-10-31 13:39:02 -04:00
twitter-52M.csv	Add wordfreq_data files.	2013-10-31 13:39:02 -04:00
twitter-stems-2014-nl.csv	start a new multilingual wordlist called 'stems'	2015-03-31 15:59:30 -04:00
twitter-stems-2014.csv	add twitter-stems-2014 wordlist data	2015-02-11 13:29:32 -05:00
twitter-surfaces-2014-nl.csv	start a new multilingual wordlist called 'stems'	2015-03-31 15:59:30 -04:00
twitter-surfaces-2014.csv	new Dutch data, bump version to 0.6	2015-03-03 15:54:45 -05:00

README.txt

This directory contains two wordlists we've put together at Luminoso for our
own purposes. You might find them useful as well.

* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
  The words are not distinguished by language.

* `multi` combines various sources of data in different languages, including:

  * Google Books, for English
  * A smaller corpus of tweets that supposedly come from English speakers
    (there's still a lot of non-English text in there)
  * the Leeds corpora for various languages (see `../leeds/README.txt`)

We would like to release the tools that built `twitter-52M` as soon as they are
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
we happen to already be using, but you might find it useful as well.