mention tokenization change in changelog

This commit is contained in:
Rob Speer 2017-01-05 19:19:31 -05:00
parent 803ebc25bb
commit a05a1c8d5c

View File

@ -12,6 +12,8 @@
- Drop the Common Crawl; we have enough good sources now that we don't have
to deal with all that spam
- Add automatic transliteration of Serbian text
- Adjust tokenization of apostrophes next to vowel sounds: the French word
"l'heure" is now tokenized similarly to "l'arc"
- Another new frequency-merging strategy (drop the highest and lowest,
average the rest)