wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-25 10:15:23 +00:00

Author	SHA1	Message	Date
Rob Speer	963e0ff785	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	63295fc397	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Andrew Lin	10bddfe09f	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	a5553676e4	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Joshua Chin	423b2d8443	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	d0e0287d71	updated comments Former-commit-id: `131b916c57`	2015-07-17 14:50:12 -04:00
Andrew Lin	081fde93e3	Express the combining of word frequencies in an explicitly associative and commutative way. Former-commit-id: `32b4033d63`	2015-07-09 15:29:05 -04:00
Joshua Chin	b145e02ce4	removed unused imports Former-commit-id: `b9578ae21e`	2015-07-07 16:21:22 -04:00
Joshua Chin	927aaae920	updated minimum Former-commit-id: `59c03e2411`	2015-07-07 15:46:33 -04:00
Joshua Chin	53323f8ea7	added arabic tests Former-commit-id: `f83d31a357`	2015-07-07 15:10:59 -04:00
Joshua Chin	d88470df4e	changed default to minimum for word_frequency Former-commit-id: `9aa773aa2b`	2015-07-07 15:03:26 -04:00
Joshua Chin	54f66d49ee	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Rob Speer	3bf59fec57	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Rob Speer	b84ba2bc2e	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Rob Speer	8cac81666a	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Joshua Chin	5cc3dce834	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	53c558ca90	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	ea5470a85a	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	000491c7cc	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	09966989fb	updated tests for emoji splitting Former-commit-id: `3bcb3e84a1`	2015-06-25 11:25:51 -04:00
Rob Speer	b4600c9bd1	Switch to a more precise centibel scale. Former-commit-id: `7862a4d2b6`	2015-06-22 17:36:30 -04:00
Joshua Chin	529aa9afde	updated test because the new tokenizer removes URLs Former-commit-id: `35f472fcf9`	2015-06-18 11:38:28 -04:00
Rob Speer	5b4107bd1d	tests for new wordfreq with full coverage Former-commit-id: `df863a5169`	2015-05-21 20:34:17 -04:00

25 Commits