Rob Speer
3ec92a8952
Handle Japanese edge cases in simple_tokenize
2018-04-26 15:53:07 -04:00
Rob Speer
6f1a9aaff1
remove LAUGHTER_WORDS, which is now unused
...
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Rob Speer
1594ba3ad6
Test that we can leave the wordlist unspecified and get 'large' freqs
2018-03-08 18:09:57 -05:00
Rob Speer
5a5acec9ff
reorganize wordlists into 'small', 'large', and 'best'
2018-03-08 17:52:44 -05:00
Rob Speer
45b9bcdbcb
Separate preprocessing from tokenization
2018-03-08 16:26:17 -05:00
Rob Speer
e3352392cc
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Rob Speer
dcef5813b3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Rob Speer
f03a37e19c
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Rob Speer
4dfa800cd8
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Rob Speer
f671a1db7f
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Rob Speer
99b627a300
Revise multilingual tests
...
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
c3fd3bd734
fix Arabic test, where 'lol' is no longer common
...
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Rob Speer
c2eab6881e
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Rob Speer
a32162c04f
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Rob Speer
963e0ff785
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
63295fc397
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
f4cf46ab9c
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Andrew Lin
10bddfe09f
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
a5553676e4
Switch to more explanatory Unicode escapes when testing NFKC normalization.
...
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Joshua Chin
423b2d8443
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
d0e0287d71
updated comments
...
Former-commit-id: 131b916c57
2015-07-17 14:50:12 -04:00
Andrew Lin
081fde93e3
Express the combining of word frequencies in an explicitly associative and commutative way.
...
Former-commit-id: 32b4033d63
2015-07-09 15:29:05 -04:00
Joshua Chin
b145e02ce4
removed unused imports
...
Former-commit-id: b9578ae21e
2015-07-07 16:21:22 -04:00
Joshua Chin
927aaae920
updated minimum
...
Former-commit-id: 59c03e2411
2015-07-07 15:46:33 -04:00
Joshua Chin
53323f8ea7
added arabic tests
...
Former-commit-id: f83d31a357
2015-07-07 15:10:59 -04:00
Joshua Chin
d88470df4e
changed default to minimum for word_frequency
...
Former-commit-id: 9aa773aa2b
2015-07-07 15:03:26 -04:00
Joshua Chin
54f66d49ee
updated tests
...
Former-commit-id: ca66a5f883
2015-07-07 14:13:28 -04:00
Rob Speer
3bf59fec57
test and document new twitter wordlists
...
Former-commit-id: 14cb408100
2015-07-01 17:53:38 -04:00
Rob Speer
b84ba2bc2e
update data using new build
...
Former-commit-id: f9a9ee7a82
2015-07-01 11:18:39 -04:00
Rob Speer
8cac81666a
case-fold instead of just lowercasing tokens
...
Former-commit-id: 638467f600
2015-06-30 15:14:02 -04:00
Joshua Chin
5cc3dce834
revert changes to test_not_really_random
...
Former-commit-id: bbf7b9de34
2015-06-30 11:29:14 -04:00
Joshua Chin
53c558ca90
changed english test to take random ascii words
...
Former-commit-id: a49b66880e
2015-06-29 11:05:01 -04:00
Joshua Chin
ea5470a85a
changed japanese test because the most common japanese ascii word keeps changing
...
Former-commit-id: 5ed03b006c
2015-06-29 11:04:19 -04:00
Joshua Chin
000491c7cc
Japanese people do not 'lol', they 'w'
...
Former-commit-id: 17f11ebd26
2015-06-29 11:01:13 -04:00
Joshua Chin
09966989fb
updated tests for emoji splitting
...
Former-commit-id: 3bcb3e84a1
2015-06-25 11:25:51 -04:00
Rob Speer
b4600c9bd1
Switch to a more precise centibel scale.
...
Former-commit-id: 7862a4d2b6
2015-06-22 17:36:30 -04:00
Joshua Chin
529aa9afde
updated test because the new tokenizer removes URLs
...
Former-commit-id: 35f472fcf9
2015-06-18 11:38:28 -04:00
Rob Speer
5b4107bd1d
tests for new wordfreq with full coverage
...
Former-commit-id: df863a5169
2015-05-21 20:34:17 -04:00