diff --git a/README.md b/README.md index 1d57786..4191f7c 100644 --- a/README.md +++ b/README.md @@ -167,6 +167,7 @@ The sources (and the abbreviations we'll use for them) are: - **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX - **Twitter**: Messages sampled from Twitter's public stream - **Wpedia**: The full text of Wikipedia in 2015 +- **Reddit**: The corpus of Reddit comments through May 2015 - **Other**: We get additional English frequencies from Google Books Syntactic Ngrams 2013, and Chinese frequencies from the frequency dictionary that comes with the Jieba tokenizer. @@ -174,35 +175,37 @@ The sources (and the abbreviations we'll use for them) are: The following 17 languages are well-supported, with reasonable tokenization and at least 3 different sources of word frequencies: - Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other + Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other ──────────────────┼───────────────────────────────────────────────────── - Arabic ar │ - Yes Yes Yes Yes - - German de │ Yes - Yes Yes[1] Yes - - Greek el │ - Yes Yes Yes Yes - - English en │ Yes Yes Yes Yes Yes Google Books - Spanish es │ - Yes Yes Yes Yes - - French fr │ - Yes Yes Yes Yes - - Indonesian id │ - Yes - Yes Yes - - Italian it │ - Yes Yes Yes Yes - - Japanese ja │ - - Yes Yes Yes - - Malay ms │ - Yes - Yes Yes - - Dutch nl │ Yes Yes - Yes Yes - - Polish pl │ - Yes - Yes Yes - - Portuguese pt │ - Yes Yes Yes Yes - - Russian ru │ - Yes Yes Yes Yes - - Swedish sv │ - Yes - Yes Yes - - Turkish tr │ - Yes - Yes Yes - - Chinese zh │ Yes - Yes - - Jieba + Arabic ar │ - Yes Yes Yes Yes - - + German de │ Yes - Yes Yes[1] Yes Yes - + Greek el │ - Yes Yes Yes Yes - - + English en │ Yes Yes Yes Yes Yes Yes Google Books + Spanish es │ - Yes Yes Yes Yes Yes - + French fr │ - Yes Yes Yes Yes - - + Indonesian id │ - Yes - Yes Yes - - + Italian it │ - Yes Yes Yes Yes - - + Japanese ja │ - - Yes Yes Yes - - + Malay ms │ - Yes - Yes Yes - - + Dutch nl │ Yes Yes - Yes Yes - - + Polish pl │ - Yes - Yes Yes - - + Portuguese pt │ - Yes Yes Yes Yes - - + Russian ru │ - Yes Yes Yes Yes - - + Swedish sv │ - Yes - Yes Yes Yes - + Turkish tr │ - Yes - Yes Yes - - + Chinese zh │ Yes - Yes - - - Jieba Additionally, Korean is marginally supported. You can look up frequencies in -it, but we have too few data sources for it so far: +it, but it will be insufficiently tokenized into words, and we have too few +data sources for it so far: Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia ──────────────────┼─────────────────────────────────────── Korean ko │ - - - Yes Yes -The 'large' wordlists are available in English, Spanish, French, and Portuguese. +The 'large' wordlists are available in English, German, Spanish, French, and +Portuguese. [1] We've counted the frequencies from tweets in German, such as they are, but you should be aware that German is not a frequently-used language on Twitter. diff --git a/wordfreq/data/combined_ar.msgpack.gz b/wordfreq/data/combined_ar.msgpack.gz index 1096472..001a31a 100644 Binary files a/wordfreq/data/combined_ar.msgpack.gz and b/wordfreq/data/combined_ar.msgpack.gz differ diff --git a/wordfreq/data/combined_de.msgpack.gz b/wordfreq/data/combined_de.msgpack.gz index 94af721..5e5f433 100644 Binary files a/wordfreq/data/combined_de.msgpack.gz and b/wordfreq/data/combined_de.msgpack.gz differ diff --git a/wordfreq/data/combined_el.msgpack.gz b/wordfreq/data/combined_el.msgpack.gz index 856abc1..742b104 100644 Binary files a/wordfreq/data/combined_el.msgpack.gz and b/wordfreq/data/combined_el.msgpack.gz differ diff --git a/wordfreq/data/combined_en.msgpack.gz b/wordfreq/data/combined_en.msgpack.gz index bbd0cc4..0cf0cbf 100644 Binary files a/wordfreq/data/combined_en.msgpack.gz and b/wordfreq/data/combined_en.msgpack.gz differ diff --git a/wordfreq/data/combined_es.msgpack.gz b/wordfreq/data/combined_es.msgpack.gz index 39f0eea..0f69a11 100644 Binary files a/wordfreq/data/combined_es.msgpack.gz and b/wordfreq/data/combined_es.msgpack.gz differ diff --git a/wordfreq/data/combined_fr.msgpack.gz b/wordfreq/data/combined_fr.msgpack.gz index 6faea92..2e489e1 100644 Binary files a/wordfreq/data/combined_fr.msgpack.gz and b/wordfreq/data/combined_fr.msgpack.gz differ diff --git a/wordfreq/data/combined_id.msgpack.gz b/wordfreq/data/combined_id.msgpack.gz index 9b33049..a0f1aaf 100644 Binary files a/wordfreq/data/combined_id.msgpack.gz and b/wordfreq/data/combined_id.msgpack.gz differ diff --git a/wordfreq/data/combined_it.msgpack.gz b/wordfreq/data/combined_it.msgpack.gz index 741f518..f5ba145 100644 Binary files a/wordfreq/data/combined_it.msgpack.gz and b/wordfreq/data/combined_it.msgpack.gz differ diff --git a/wordfreq/data/combined_ja.msgpack.gz b/wordfreq/data/combined_ja.msgpack.gz index f1c660d..9c21ccd 100644 Binary files a/wordfreq/data/combined_ja.msgpack.gz and b/wordfreq/data/combined_ja.msgpack.gz differ diff --git a/wordfreq/data/combined_ko.msgpack.gz b/wordfreq/data/combined_ko.msgpack.gz index 5dda29a..3d8f631 100644 Binary files a/wordfreq/data/combined_ko.msgpack.gz and b/wordfreq/data/combined_ko.msgpack.gz differ diff --git a/wordfreq/data/combined_ms.msgpack.gz b/wordfreq/data/combined_ms.msgpack.gz index d7f4ad7..8c35a94 100644 Binary files a/wordfreq/data/combined_ms.msgpack.gz and b/wordfreq/data/combined_ms.msgpack.gz differ diff --git a/wordfreq/data/combined_nl.msgpack.gz b/wordfreq/data/combined_nl.msgpack.gz index 48d681a..37dc82b 100644 Binary files a/wordfreq/data/combined_nl.msgpack.gz and b/wordfreq/data/combined_nl.msgpack.gz differ diff --git a/wordfreq/data/combined_pl.msgpack.gz b/wordfreq/data/combined_pl.msgpack.gz index 2d45b1a..1ccf2d3 100644 Binary files a/wordfreq/data/combined_pl.msgpack.gz and b/wordfreq/data/combined_pl.msgpack.gz differ diff --git a/wordfreq/data/combined_pt.msgpack.gz b/wordfreq/data/combined_pt.msgpack.gz index 7371866..0f98dc0 100644 Binary files a/wordfreq/data/combined_pt.msgpack.gz and b/wordfreq/data/combined_pt.msgpack.gz differ diff --git a/wordfreq/data/combined_ru.msgpack.gz b/wordfreq/data/combined_ru.msgpack.gz index 123eb54..b8f77ce 100644 Binary files a/wordfreq/data/combined_ru.msgpack.gz and b/wordfreq/data/combined_ru.msgpack.gz differ diff --git a/wordfreq/data/combined_sv.msgpack.gz b/wordfreq/data/combined_sv.msgpack.gz index 0cc1398..4aa7520 100644 Binary files a/wordfreq/data/combined_sv.msgpack.gz and b/wordfreq/data/combined_sv.msgpack.gz differ diff --git a/wordfreq/data/combined_tr.msgpack.gz b/wordfreq/data/combined_tr.msgpack.gz index 3f6063c..ffbeaa1 100644 Binary files a/wordfreq/data/combined_tr.msgpack.gz and b/wordfreq/data/combined_tr.msgpack.gz differ diff --git a/wordfreq/data/combined_zh.msgpack.gz b/wordfreq/data/combined_zh.msgpack.gz index 1205f84..4a783be 100644 Binary files a/wordfreq/data/combined_zh.msgpack.gz and b/wordfreq/data/combined_zh.msgpack.gz differ diff --git a/wordfreq/data/large_en.msgpack.gz b/wordfreq/data/large_en.msgpack.gz index ebb70f1..8a93bb9 100644 Binary files a/wordfreq/data/large_en.msgpack.gz and b/wordfreq/data/large_en.msgpack.gz differ diff --git a/wordfreq/data/large_es.msgpack.gz b/wordfreq/data/large_es.msgpack.gz index ea8e395..0fb0a11 100644 Binary files a/wordfreq/data/large_es.msgpack.gz and b/wordfreq/data/large_es.msgpack.gz differ diff --git a/wordfreq/data/large_fr.msgpack.gz b/wordfreq/data/large_fr.msgpack.gz index 8bae954..420ed0e 100644 Binary files a/wordfreq/data/large_fr.msgpack.gz and b/wordfreq/data/large_fr.msgpack.gz differ diff --git a/wordfreq/data/large_pt.msgpack.gz b/wordfreq/data/large_pt.msgpack.gz index 765c802..0517dd7 100644 Binary files a/wordfreq/data/large_pt.msgpack.gz and b/wordfreq/data/large_pt.msgpack.gz differ diff --git a/wordfreq/data/twitter_ar.msgpack.gz b/wordfreq/data/twitter_ar.msgpack.gz index d87307f..956a4a9 100644 Binary files a/wordfreq/data/twitter_ar.msgpack.gz and b/wordfreq/data/twitter_ar.msgpack.gz differ diff --git a/wordfreq/data/twitter_de.msgpack.gz b/wordfreq/data/twitter_de.msgpack.gz index 9422fb5..169ea53 100644 Binary files a/wordfreq/data/twitter_de.msgpack.gz and b/wordfreq/data/twitter_de.msgpack.gz differ diff --git a/wordfreq/data/twitter_el.msgpack.gz b/wordfreq/data/twitter_el.msgpack.gz index af1a0b1..7b8f654 100644 Binary files a/wordfreq/data/twitter_el.msgpack.gz and b/wordfreq/data/twitter_el.msgpack.gz differ diff --git a/wordfreq/data/twitter_en.msgpack.gz b/wordfreq/data/twitter_en.msgpack.gz index cb81d3d..331d651 100644 Binary files a/wordfreq/data/twitter_en.msgpack.gz and b/wordfreq/data/twitter_en.msgpack.gz differ diff --git a/wordfreq/data/twitter_es.msgpack.gz b/wordfreq/data/twitter_es.msgpack.gz index 9f80432..ecb90e6 100644 Binary files a/wordfreq/data/twitter_es.msgpack.gz and b/wordfreq/data/twitter_es.msgpack.gz differ diff --git a/wordfreq/data/twitter_fr.msgpack.gz b/wordfreq/data/twitter_fr.msgpack.gz index 05de393..534bf8f 100644 Binary files a/wordfreq/data/twitter_fr.msgpack.gz and b/wordfreq/data/twitter_fr.msgpack.gz differ diff --git a/wordfreq/data/twitter_id.msgpack.gz b/wordfreq/data/twitter_id.msgpack.gz index 579964a..e14513e 100644 Binary files a/wordfreq/data/twitter_id.msgpack.gz and b/wordfreq/data/twitter_id.msgpack.gz differ diff --git a/wordfreq/data/twitter_it.msgpack.gz b/wordfreq/data/twitter_it.msgpack.gz index 174235b..66939da 100644 Binary files a/wordfreq/data/twitter_it.msgpack.gz and b/wordfreq/data/twitter_it.msgpack.gz differ diff --git a/wordfreq/data/twitter_ja.msgpack.gz b/wordfreq/data/twitter_ja.msgpack.gz index 8f739f9..683a102 100644 Binary files a/wordfreq/data/twitter_ja.msgpack.gz and b/wordfreq/data/twitter_ja.msgpack.gz differ diff --git a/wordfreq/data/twitter_ko.msgpack.gz b/wordfreq/data/twitter_ko.msgpack.gz index 334a127..244448d 100644 Binary files a/wordfreq/data/twitter_ko.msgpack.gz and b/wordfreq/data/twitter_ko.msgpack.gz differ diff --git a/wordfreq/data/twitter_ms.msgpack.gz b/wordfreq/data/twitter_ms.msgpack.gz index 346bdaa..52a654f 100644 Binary files a/wordfreq/data/twitter_ms.msgpack.gz and b/wordfreq/data/twitter_ms.msgpack.gz differ diff --git a/wordfreq/data/twitter_nl.msgpack.gz b/wordfreq/data/twitter_nl.msgpack.gz index 7681324..95dec24 100644 Binary files a/wordfreq/data/twitter_nl.msgpack.gz and b/wordfreq/data/twitter_nl.msgpack.gz differ diff --git a/wordfreq/data/twitter_pl.msgpack.gz b/wordfreq/data/twitter_pl.msgpack.gz index 11b61eb..dfc59c0 100644 Binary files a/wordfreq/data/twitter_pl.msgpack.gz and b/wordfreq/data/twitter_pl.msgpack.gz differ diff --git a/wordfreq/data/twitter_pt.msgpack.gz b/wordfreq/data/twitter_pt.msgpack.gz index 0e845ab..c03bd18 100644 Binary files a/wordfreq/data/twitter_pt.msgpack.gz and b/wordfreq/data/twitter_pt.msgpack.gz differ diff --git a/wordfreq/data/twitter_ru.msgpack.gz b/wordfreq/data/twitter_ru.msgpack.gz index e426344..0cb120e 100644 Binary files a/wordfreq/data/twitter_ru.msgpack.gz and b/wordfreq/data/twitter_ru.msgpack.gz differ diff --git a/wordfreq/data/twitter_sv.msgpack.gz b/wordfreq/data/twitter_sv.msgpack.gz index ab1e956..cee9c39 100644 Binary files a/wordfreq/data/twitter_sv.msgpack.gz and b/wordfreq/data/twitter_sv.msgpack.gz differ diff --git a/wordfreq/data/twitter_tr.msgpack.gz b/wordfreq/data/twitter_tr.msgpack.gz index 28eefa6..5360927 100644 Binary files a/wordfreq/data/twitter_tr.msgpack.gz and b/wordfreq/data/twitter_tr.msgpack.gz differ