wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	c72326e4c0	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Robyn Speer	f25985379c	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Robyn Speer	51e260b713	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Robyn Speer	9a007b9948	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Robyn Speer	8795525372	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Andrew Lin	e88cf3fdaf	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	b0fac15f98	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Joshua Chin	af8050f1b8	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	e8fa25cb73	updated comments Former-commit-id: `131b916c57`	2015-07-17 14:50:12 -04:00
Andrew Lin	5c72e68b7e	Express the combining of word frequencies in an explicitly associative and commutative way. Former-commit-id: `32b4033d63`	2015-07-09 15:29:05 -04:00
Joshua Chin	d4409a2214	removed unused imports Former-commit-id: `b9578ae21e`	2015-07-07 16:21:22 -04:00
Joshua Chin	4b398fac65	updated minimum Former-commit-id: `59c03e2411`	2015-07-07 15:46:33 -04:00
Joshua Chin	b3a008f992	added arabic tests Former-commit-id: `f83d31a357`	2015-07-07 15:10:59 -04:00
Joshua Chin	21c809416d	changed default to minimum for word_frequency Former-commit-id: `9aa773aa2b`	2015-07-07 15:03:26 -04:00
Joshua Chin	9c741bb341	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Robyn Speer	9615b9f843	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Robyn Speer	a9b9b2f080	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Robyn Speer	4997d776b9	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Joshua Chin	fbd15947bb	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	9b02abb5ea	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	d10109bb38	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	fa89956df3	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	a0b7211451	updated tests for emoji splitting Former-commit-id: `3bcb3e84a1`	2015-06-25 11:25:51 -04:00
Robyn Speer	f3958d63ae	Switch to a more precise centibel scale. Former-commit-id: `7862a4d2b6`	2015-06-22 17:36:30 -04:00
Joshua Chin	4706a38c7a	updated test because the new tokenizer removes URLs Former-commit-id: `35f472fcf9`	2015-06-18 11:38:28 -04:00
Robyn Speer	26517c1b86	tests for new wordfreq with full coverage Former-commit-id: `df863a5169`	2015-05-21 20:34:17 -04:00

31 Commits