Commit Graph

24 Commits

Author SHA1 Message Date
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
4704131e13 add tests for Turkish
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
8795525372 Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Andrew Lin
e88cf3fdaf Document the NFKC-normalized ligature in the Arabic test.
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
b0fac15f98 Switch to more explanatory Unicode escapes when testing NFKC normalization.
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Joshua Chin
af8050f1b8 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
e8fa25cb73 updated comments
Former-commit-id: 131b916c57
2015-07-17 14:50:12 -04:00
Andrew Lin
5c72e68b7e Express the combining of word frequencies in an explicitly associative and commutative way.
Former-commit-id: 32b4033d63
2015-07-09 15:29:05 -04:00
Joshua Chin
d4409a2214 removed unused imports
Former-commit-id: b9578ae21e
2015-07-07 16:21:22 -04:00
Joshua Chin
4b398fac65 updated minimum
Former-commit-id: 59c03e2411
2015-07-07 15:46:33 -04:00
Joshua Chin
b3a008f992 added arabic tests
Former-commit-id: f83d31a357
2015-07-07 15:10:59 -04:00
Joshua Chin
21c809416d changed default to minimum for word_frequency
Former-commit-id: 9aa773aa2b
2015-07-07 15:03:26 -04:00
Joshua Chin
9c741bb341 updated tests
Former-commit-id: ca66a5f883
2015-07-07 14:13:28 -04:00
Robyn Speer
9615b9f843 test and document new twitter wordlists
Former-commit-id: 14cb408100
2015-07-01 17:53:38 -04:00
Robyn Speer
a9b9b2f080 update data using new build
Former-commit-id: f9a9ee7a82
2015-07-01 11:18:39 -04:00
Robyn Speer
4997d776b9 case-fold instead of just lowercasing tokens
Former-commit-id: 638467f600
2015-06-30 15:14:02 -04:00
Joshua Chin
fbd15947bb revert changes to test_not_really_random
Former-commit-id: bbf7b9de34
2015-06-30 11:29:14 -04:00
Joshua Chin
9b02abb5ea changed english test to take random ascii words
Former-commit-id: a49b66880e
2015-06-29 11:05:01 -04:00
Joshua Chin
d10109bb38 changed japanese test because the most common japanese ascii word keeps changing
Former-commit-id: 5ed03b006c
2015-06-29 11:04:19 -04:00
Joshua Chin
fa89956df3 Japanese people do not 'lol', they 'w'
Former-commit-id: 17f11ebd26
2015-06-29 11:01:13 -04:00
Joshua Chin
a0b7211451 updated tests for emoji splitting
Former-commit-id: 3bcb3e84a1
2015-06-25 11:25:51 -04:00
Robyn Speer
f3958d63ae Switch to a more precise centibel scale.
Former-commit-id: 7862a4d2b6
2015-06-22 17:36:30 -04:00
Joshua Chin
4706a38c7a updated test because the new tokenizer removes URLs
Former-commit-id: 35f472fcf9
2015-06-18 11:38:28 -04:00
Robyn Speer
26517c1b86 tests for new wordfreq with full coverage
Former-commit-id: df863a5169
2015-05-21 20:34:17 -04:00