wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

History

Robyn Speer b22a4b0f02 Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
..
test_tokenizer.py	Strip apostrophes from edges of tokens	2015-08-25 12:41:48 -04:00

Robyn Speer b22a4b0f02 Strip apostrophes from edges of tokens

The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa

2015-08-25 12:41:48 -04:00

test_tokenizer.py Strip apostrophes from edges of tokens 2015-08-25 12:41:48 -04:00