wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Rob Speer	e6a8f028e3	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian	2016-07-28 19:23:17 -04:00
Rob Speer	fec6eddcc3	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function	2016-07-15 15:10:25 -04:00
Rob Speer	dcb77a552b	fix to README: we're only using Reddit in English	2016-05-11 15:38:29 -04:00
Rob Speer	697842b3f9	fix table showing marginal Korean support	2016-03-30 15:11:13 -04:00
Rob Speer	ed32b278cc	make an example clearer with wordlist='large'	2016-03-30 15:08:32 -04:00
Rob Speer	a10c1d7ac0	update wordlists for new builder settings	2016-03-28 12:26:47 -04:00
Rob Speer	d79ee37da9	Add and document large wordlists	2016-01-22 16:23:43 -05:00
Rob Speer	1793c1bb2e	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-28 14:34:59 -04:00
Rob Speer	44b0c4f9ba	Fix documentation and clean up, based on Sep 25 code review	2015-09-28 12:58:46 -04:00
Rob Speer	b460eef444	describe optional dependencies better in the README	2015-09-24 17:54:52 -04:00
Rob Speer	5b918e7bb0	fix README conflict	2015-09-22 14:23:55 -04:00
Rob Speer	3cb3061e06	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py	2015-09-10 15:27:33 -04:00
Rob Speer	5c8c36f4e3	Lower the frequency of phrases with inferred token boundaries	2015-09-10 14:16:22 -04:00
Rob Speer	354555514f	fixes based on code review notes	2015-09-09 13:10:18 -04:00
Rob Speer	6502f15e9b	fix SUBTLEX citations	2015-09-08 17:45:25 -04:00
Rob Speer	d9c44d5fcc	take out OpenSubtitles for Chinese	2015-09-08 17:25:05 -04:00
Rob Speer	d576e3294b	update the README for Chinese	2015-09-05 03:42:54 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	3c3371a9ff	add Polish and Swedish to README	2015-09-04 17:10:40 -04:00
Rob Speer	8196643509	add more citations	2015-09-04 15:57:40 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	81bbe663fb	update README with additional SUBTLEX support	2015-09-04 13:23:33 -04:00
Rob Speer	d9a1c34d00	expand list of sources and supported languages	2015-09-04 01:03:36 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	e6a2886a66	add SUBTLEX to the readme	2015-09-03 18:56:56 -04:00
Rob Speer	00a2812907	fix heading	2015-08-28 17:49:38 -04:00
Rob Speer	93f44683c5	fix list formatting	2015-08-28 17:49:07 -04:00
Rob Speer	2370287539	improve README with function documentation and examples	2015-08-28 17:45:50 -04:00
Rob Speer	573dd1ec79	update the README	2015-08-25 17:44:34 -04:00
Joshua Chin	b0a9a2980f	no use for use	2015-07-17 14:46:40 -04:00
Andrew Lin	9f8464c2d1	Document the version of Unicode used to build the regexes.	2015-07-08 18:48:33 -04:00
Rob Speer	0f4ca80026	add installation instructions to the readme	2015-05-28 14:02:12 -04:00
Rob Speer	611a6a35de	update Japanese data; test Japanese and token combining	2015-05-28 14:01:56 -04:00

33 Commits