Commit Graph

33 Commits

Author SHA1 Message Date
Rob Speer
e6a8f028e3 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian
2016-07-28 19:23:17 -04:00
Rob Speer
fec6eddcc3 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function
2016-07-15 15:10:25 -04:00
Rob Speer
dcb77a552b fix to README: we're only using Reddit in English 2016-05-11 15:38:29 -04:00
Rob Speer
697842b3f9 fix table showing marginal Korean support 2016-03-30 15:11:13 -04:00
Rob Speer
ed32b278cc make an example clearer with wordlist='large' 2016-03-30 15:08:32 -04:00
Rob Speer
a10c1d7ac0 update wordlists for new builder settings 2016-03-28 12:26:47 -04:00
Rob Speer
d79ee37da9 Add and document large wordlists 2016-01-22 16:23:43 -05:00
Rob Speer
1793c1bb2e Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py
2015-09-28 14:34:59 -04:00
Rob Speer
44b0c4f9ba Fix documentation and clean up, based on Sep 25 code review 2015-09-28 12:58:46 -04:00
Rob Speer
b460eef444 describe optional dependencies better in the README 2015-09-24 17:54:52 -04:00
Rob Speer
5b918e7bb0 fix README conflict 2015-09-22 14:23:55 -04:00
Rob Speer
3cb3061e06 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
5c8c36f4e3 Lower the frequency of phrases with inferred token boundaries 2015-09-10 14:16:22 -04:00
Rob Speer
354555514f fixes based on code review notes 2015-09-09 13:10:18 -04:00
Rob Speer
6502f15e9b fix SUBTLEX citations 2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc take out OpenSubtitles for Chinese 2015-09-08 17:25:05 -04:00
Rob Speer
d576e3294b update the README for Chinese 2015-09-05 03:42:54 -04:00
Rob Speer
7906a671ea WIP: Traditional Chinese 2015-09-04 18:52:37 -04:00
Rob Speer
3c3371a9ff add Polish and Swedish to README 2015-09-04 17:10:40 -04:00
Rob Speer
8196643509 add more citations 2015-09-04 15:57:40 -04:00
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
81bbe663fb update README with additional SUBTLEX support 2015-09-04 13:23:33 -04:00
Rob Speer
d9a1c34d00 expand list of sources and supported languages 2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454 support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
Rob Speer
e6a2886a66 add SUBTLEX to the readme 2015-09-03 18:56:56 -04:00
Rob Speer
00a2812907 fix heading 2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5 fix list formatting 2015-08-28 17:49:07 -04:00
Rob Speer
2370287539 improve README with function documentation and examples 2015-08-28 17:45:50 -04:00
Rob Speer
573dd1ec79 update the README 2015-08-25 17:44:34 -04:00
Joshua Chin
b0a9a2980f no use for use 2015-07-17 14:46:40 -04:00
Andrew Lin
9f8464c2d1 Document the version of Unicode used to build the regexes. 2015-07-08 18:48:33 -04:00
Rob Speer
0f4ca80026 add installation instructions to the readme 2015-05-28 14:02:12 -04:00
Rob Speer
611a6a35de update Japanese data; test Japanese and token combining 2015-05-28 14:01:56 -04:00