Document the version of Unicode used to build the regexes.

2024-12-23 17:31:41 +00:00 · 2015-07-08 18:48:33 -04:00 · 2015-07-08 18:48:33 -04:00 · 9f8464c2d1
commit 9f8464c2d1
parent cc6920d7e4
1 changed files with 6 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -21,6 +21,12 @@ install them on Ubuntu:
    sudo apt-get install mecab-ipadic-utf8 libmecab-dev
    pip3 install mecab-python3

+## Unicode data
+
+The tokenizers used to split non-Japanese phrases use regexes built using the
+`unicodedata` module from Python 3.4, which uses Unicode version 6.3.0.  To
+update these regexes, run `scripts/gen_regex.py`.
+
 ## License

 `wordfreq` is freely redistributable under the MIT license (see