mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-24 09:51:38 +00:00
mention tokenization change in changelog
This commit is contained in:
parent
39e459ac71
commit
48a5967e9a
@ -12,6 +12,8 @@
|
|||||||
- Drop the Common Crawl; we have enough good sources now that we don't have
|
- Drop the Common Crawl; we have enough good sources now that we don't have
|
||||||
to deal with all that spam
|
to deal with all that spam
|
||||||
- Add automatic transliteration of Serbian text
|
- Add automatic transliteration of Serbian text
|
||||||
|
- Adjust tokenization of apostrophes next to vowel sounds: the French word
|
||||||
|
"l'heure" is now tokenized similarly to "l'arc"
|
||||||
- Another new frequency-merging strategy (drop the highest and lowest,
|
- Another new frequency-merging strategy (drop the highest and lowest,
|
||||||
average the rest)
|
average the rest)
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user