Commit Graph

25 Commits

Author SHA1 Message Date
Rob Speer
e8e6e0a231 refactor the tokenizer, add include_punctuation option 2015-09-15 13:26:09 -04:00
Rob Speer
2327f2e4d6 tokenize Chinese using jieba and our own frequencies 2015-09-05 03:16:56 -04:00
Rob Speer
fc93c8dc9c add tests for Turkish 2015-09-04 17:00:05 -04:00
Rob Speer
95998205ad Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
Andrew Lin
41e1dd41d8 Document the NFKC-normalized ligature in the Arabic test. 2015-08-03 11:09:44 -04:00
Andrew Lin
66c69e6fac Switch to more explanatory Unicode escapes when testing NFKC normalization. 2015-07-31 19:23:42 -04:00
Joshua Chin
173278fdd3 ensure removal of tatweels (hopefully) 2015-07-20 16:48:36 -04:00
Joshua Chin
131b916c57 updated comments 2015-07-17 14:50:12 -04:00
Andrew Lin
32b4033d63 Express the combining of word frequencies in an explicitly associative and commutative way. 2015-07-09 15:29:05 -04:00
Joshua Chin
b9578ae21e removed unused imports 2015-07-07 16:21:22 -04:00
Joshua Chin
59c03e2411 updated minimum 2015-07-07 15:46:33 -04:00
Joshua Chin
f83d31a357 added arabic tests 2015-07-07 15:10:59 -04:00
Joshua Chin
9aa773aa2b changed default to minimum for word_frequency 2015-07-07 15:03:26 -04:00
Joshua Chin
ca66a5f883 updated tests 2015-07-07 14:13:28 -04:00
Rob Speer
14cb408100 test and document new twitter wordlists 2015-07-01 17:53:38 -04:00
Rob Speer
f9a9ee7a82 update data using new build 2015-07-01 11:18:39 -04:00
Rob Speer
638467f600 case-fold instead of just lowercasing tokens 2015-06-30 15:14:02 -04:00
Joshua Chin
bbf7b9de34 revert changes to test_not_really_random 2015-06-30 11:29:14 -04:00
Joshua Chin
a49b66880e changed english test to take random ascii words 2015-06-29 11:05:01 -04:00
Joshua Chin
5ed03b006c changed japanese test because the most common japanese ascii word keeps changing 2015-06-29 11:04:19 -04:00
Joshua Chin
17f11ebd26 Japanese people do not 'lol', they 'w' 2015-06-29 11:01:13 -04:00
Joshua Chin
3bcb3e84a1 updated tests for emoji splitting 2015-06-25 11:25:51 -04:00
Rob Speer
7862a4d2b6 Switch to a more precise centibel scale. 2015-06-22 17:36:30 -04:00
Joshua Chin
35f472fcf9 updated test because the new tokenizer removes URLs 2015-06-18 11:38:28 -04:00
Rob Speer
df863a5169 tests for new wordfreq with full coverage 2015-05-21 20:34:17 -04:00