Robyn Speer
|
ed23bf3ebe
|
specifically test that the long sequence underflows to 0
|
2021-02-18 15:09:31 -05:00 |
|
Robyn Speer
|
75a56b68fb
|
change math for INFERRED_SPACE_FACTOR to not overflow
|
2021-02-18 14:44:39 -05:00 |
|
Robyn Speer
|
86b928f967
|
include data from xc rebuild
|
2018-07-15 01:01:35 -04:00 |
|
Robyn Speer
|
75b4d62084
|
port test.py and test_chinese.py to pytest
|
2018-06-01 16:33:06 -04:00 |
|
Robyn Speer
|
8e3dff3c1c
|
Traditional Chinese should be preserved through tokenization
|
2018-03-08 18:08:55 -05:00 |
|
Robyn Speer
|
5ab5d2ea55
|
Separate preprocessing from tokenization
|
2018-03-08 16:26:17 -05:00 |
|
Robyn Speer
|
71a0ad6abb
|
Use langcodes when tokenizing again (it no longer connects to a DB)
|
2017-04-27 15:09:59 -04:00 |
|
Robyn Speer
|
7dc3f03ebd
|
import new wordlists from Exquisite Corpus
|
2017-01-05 17:59:26 -05:00 |
|
Robyn Speer
|
4a4534c466
|
test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
|
2015-09-24 13:41:11 -04:00 |
|
Robyn Speer
|
1adbb1aaf1
|
add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
|
2015-09-10 18:09:41 -04:00 |
|
Robyn Speer
|
f0c7c3a02c
|
Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
|
2015-09-10 14:16:22 -04:00 |
|
Robyn Speer
|
a4554fb87c
|
tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
|
2015-09-05 03:16:56 -04:00 |
|