mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
update CHANGELOG for 2.0.1
This commit is contained in:
parent
666f7e51fa
commit
e0da20b0c4
24
CHANGELOG.md
24
CHANGELOG.md
@ -1,3 +1,27 @@
|
|||||||
|
## Version 2.0.1 (2018-05-01)
|
||||||
|
|
||||||
|
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
||||||
|
run through `simple_tokenize`, because of a few characters that don't match any
|
||||||
|
of our "spaceless scripts".
|
||||||
|
|
||||||
|
It is not a typical situation for Japanese text to be passed through
|
||||||
|
`simple_tokenize`, because Japanese text should instead use the
|
||||||
|
Japanese-specific tokenization in `wordfreq.mecab`.
|
||||||
|
|
||||||
|
However, some downstream uses of wordfreq have justifiable reasons to pass all
|
||||||
|
terms through `simple_tokenize`, even terms that may be in Japanese, and in
|
||||||
|
those cases we want to detect only the most obvious token boundaries.
|
||||||
|
|
||||||
|
In this situation, we no longer try to detect script changes, such as between
|
||||||
|
kanji and katakana, as token boundaries. This particularly allows us to keep
|
||||||
|
together Japanese words where ヶ appears betwen kanji, as well as words that
|
||||||
|
use the iteration mark 々.
|
||||||
|
|
||||||
|
This change does not affect any word frequencies. (The Japanese word list uses
|
||||||
|
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Version 2.0 (2018-03-14)
|
## Version 2.0 (2018-03-14)
|
||||||
|
|
||||||
The big change in this version is that text preprocessing, tokenization, and
|
The big change in this version is that text preprocessing, tokenization, and
|
||||||
|
Loading…
Reference in New Issue
Block a user