mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-25 18:18:53 +00:00
a3b37f6619
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id:
|
||
---|---|---|
.. | ||
test_tokenizer.py |