Commit Graph

9 Commits

Author SHA1 Message Date
Rob Speer
863d5be522 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Rob Speer
47dac3b0b8 Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Rob Speer
45b9bcdbcb Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Rob Speer
d6cdef6039 Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Rob Speer
f671a1db7f import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Rob Speer
f89ac5e400 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Rob Speer
e3a79ab8c9 add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Rob Speer
a13f459f88 Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Rob Speer
91cc82f76d tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00