From e0da20b0c48ee4d349bceba84e3dc7c23a96bbea Mon Sep 17 00:00:00 2001
From: Robyn Speer <rspeer@luminoso.com>
Date: Tue, 1 May 2018 14:47:55 -0400
Subject: [PATCH] update CHANGELOG for 2.0.1

---
 CHANGELOG.md | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index f812478..08e52fa 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,3 +1,27 @@
+## Version 2.0.1 (2018-05-01)
+
+Fixed edge cases that inserted spurious token boundaries when Japanese text is
+run through `simple_tokenize`, because of a few characters that don't match any
+of our "spaceless scripts".
+
+It is not a typical situation for Japanese text to be passed through
+`simple_tokenize`, because Japanese text should instead use the
+Japanese-specific tokenization in `wordfreq.mecab`.
+
+However, some downstream uses of wordfreq have justifiable reasons to pass all
+terms through `simple_tokenize`, even terms that may be in Japanese, and in
+those cases we want to detect only the most obvious token boundaries.
+
+In this situation, we no longer try to detect script changes, such as between
+kanji and katakana, as token boundaries. This particularly allows us to keep
+together Japanese words where ヶ appears betwen kanji, as well as words that
+use the iteration mark 々.
+
+This change does not affect any word frequencies. (The Japanese word list uses
+`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
+
+
+
 ## Version 2.0 (2018-03-14)
 
 The big change in this version is that text preprocessing, tokenization, and