update CHANGELOG for 2.0.1

2024-12-23 17:31:41 +00:00 · 2018-05-01 14:47:55 -04:00 · 2018-05-01 14:47:55 -04:00 · 0a95d96b20
commit 0a95d96b20
parent 3ec92a8952
1 changed files with 24 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,27 @@
+## Version 2.0.1 (2018-05-01)
+
+Fixed edge cases that inserted spurious token boundaries when Japanese text is
+run through `simple_tokenize`, because of a few characters that don't match any
+of our "spaceless scripts".
+
+It is not a typical situation for Japanese text to be passed through
+`simple_tokenize`, because Japanese text should instead use the
+Japanese-specific tokenization in `wordfreq.mecab`.
+
+However, some downstream uses of wordfreq have justifiable reasons to pass all
+terms through `simple_tokenize`, even terms that may be in Japanese, and in
+those cases we want to detect only the most obvious token boundaries.
+
+In this situation, we no longer try to detect script changes, such as between
+kanji and katakana, as token boundaries. This particularly allows us to keep
+together Japanese words where ヶ appears betwen kanji, as well as words that
+use the iteration mark 々.
+
+This change does not affect any word frequencies. (The Japanese word list uses
+`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
+
+
+
 ## Version 2.0 (2018-03-14)

 The big change in this version is that text preprocessing, tokenization, and