mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
07f16e6f03
Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even *have* an appropriate tokenizer for Thai, so I've added a similar fallback. |
||
---|---|---|
.. | ||
test_chinese.py | ||
test_japanese.py | ||
test.py |