Commit Graph

15 Commits

Author SHA1 Message Date
Elia Robyn Lake
bf05b1b1dc estimate the freq distribution of numbers 2022-03-10 18:33:42 -05:00
Elia Robyn Speer
b60ac1b803 Merge remote-tracking branch 'origin/apostrophe-consistency' 2021-09-02 18:13:53 +00:00
Robyn Speer
ed23bf3ebe specifically test that the long sequence underflows to 0 2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb change math for INFERRED_SPACE_FACTOR to not overflow 2021-02-18 14:44:39 -05:00
Robyn Speer
ad02d96f1b update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00
Robyn Speer
86b928f967 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Robyn Speer
75b4d62084 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Robyn Speer
8e3dff3c1c Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Robyn Speer
5ab5d2ea55 Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Robyn Speer
71a0ad6abb Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Robyn Speer
7dc3f03ebd import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Robyn Speer
4a4534c466 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
1adbb1aaf1 add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Robyn Speer
f0c7c3a02c Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00