Robyn Speer
f25985379c
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Robyn Speer
51e260b713
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Robyn Speer
9a007b9948
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
4704131e13
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
8795525372
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Andrew Lin
e88cf3fdaf
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
b0fac15f98
Switch to more explanatory Unicode escapes when testing NFKC normalization.
...
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Joshua Chin
af8050f1b8
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
e8fa25cb73
updated comments
...
Former-commit-id: 131b916c57
2015-07-17 14:50:12 -04:00
Andrew Lin
5c72e68b7e
Express the combining of word frequencies in an explicitly associative and commutative way.
...
Former-commit-id: 32b4033d63
2015-07-09 15:29:05 -04:00
Joshua Chin
d4409a2214
removed unused imports
...
Former-commit-id: b9578ae21e
2015-07-07 16:21:22 -04:00
Joshua Chin
4b398fac65
updated minimum
...
Former-commit-id: 59c03e2411
2015-07-07 15:46:33 -04:00
Joshua Chin
b3a008f992
added arabic tests
...
Former-commit-id: f83d31a357
2015-07-07 15:10:59 -04:00
Joshua Chin
21c809416d
changed default to minimum for word_frequency
...
Former-commit-id: 9aa773aa2b
2015-07-07 15:03:26 -04:00
Joshua Chin
9c741bb341
updated tests
...
Former-commit-id: ca66a5f883
2015-07-07 14:13:28 -04:00
Robyn Speer
9615b9f843
test and document new twitter wordlists
...
Former-commit-id: 14cb408100
2015-07-01 17:53:38 -04:00
Robyn Speer
a9b9b2f080
update data using new build
...
Former-commit-id: f9a9ee7a82
2015-07-01 11:18:39 -04:00
Robyn Speer
4997d776b9
case-fold instead of just lowercasing tokens
...
Former-commit-id: 638467f600
2015-06-30 15:14:02 -04:00
Joshua Chin
fbd15947bb
revert changes to test_not_really_random
...
Former-commit-id: bbf7b9de34
2015-06-30 11:29:14 -04:00
Joshua Chin
9b02abb5ea
changed english test to take random ascii words
...
Former-commit-id: a49b66880e
2015-06-29 11:05:01 -04:00
Joshua Chin
d10109bb38
changed japanese test because the most common japanese ascii word keeps changing
...
Former-commit-id: 5ed03b006c
2015-06-29 11:04:19 -04:00
Joshua Chin
fa89956df3
Japanese people do not 'lol', they 'w'
...
Former-commit-id: 17f11ebd26
2015-06-29 11:01:13 -04:00
Joshua Chin
a0b7211451
updated tests for emoji splitting
...
Former-commit-id: 3bcb3e84a1
2015-06-25 11:25:51 -04:00
Robyn Speer
f3958d63ae
Switch to a more precise centibel scale.
...
Former-commit-id: 7862a4d2b6
2015-06-22 17:36:30 -04:00
Joshua Chin
4706a38c7a
updated test because the new tokenizer removes URLs
...
Former-commit-id: 35f472fcf9
2015-06-18 11:38:28 -04:00
Robyn Speer
26517c1b86
tests for new wordfreq with full coverage
...
Former-commit-id: df863a5169
2015-05-21 20:34:17 -04:00