From e0da20b0c48ee4d349bceba84e3dc7c23a96bbea Mon Sep 17 00:00:00 2001 From: Robyn Speer <rspeer@luminoso.com> Date: Tue, 1 May 2018 14:47:55 -0400 Subject: [PATCH] update CHANGELOG for 2.0.1 --- CHANGELOG.md | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index f812478..08e52fa 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,3 +1,27 @@ +## Version 2.0.1 (2018-05-01) + +Fixed edge cases that inserted spurious token boundaries when Japanese text is +run through `simple_tokenize`, because of a few characters that don't match any +of our "spaceless scripts". + +It is not a typical situation for Japanese text to be passed through +`simple_tokenize`, because Japanese text should instead use the +Japanese-specific tokenization in `wordfreq.mecab`. + +However, some downstream uses of wordfreq have justifiable reasons to pass all +terms through `simple_tokenize`, even terms that may be in Japanese, and in +those cases we want to detect only the most obvious token boundaries. + +In this situation, we no longer try to detect script changes, such as between +kanji and katakana, as token boundaries. This particularly allows us to keep +together Japanese words where ヶ appears betwen kanji, as well as words that +use the iteration mark 々. + +This change does not affect any word frequencies. (The Japanese word list uses +`wordfreq.mecab` for tokenization, not `simple_tokenize`.) + + + ## Version 2.0 (2018-03-14) The big change in this version is that text preprocessing, tokenization, and