wordfreq/tests
Rob Speer 9758c69ff0 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
..
test_chinese.py test_chinese: fix typo in comment 2015-09-24 13:41:11 -04:00
test_japanese.py Revert a small syntax change introduced by a circular series of changes. 2015-09-24 13:24:11 -04:00
test_korean.py Tokenization in Korean, plus abjad languages (#38) 2016-07-15 15:10:25 -04:00
test.py Add Common Crawl data and more languages (#39) 2016-07-28 19:23:17 -04:00