From 89617294017bb88688c98ec1586187a469814829 Mon Sep 17 00:00:00 2001 From: Andrew Lin Date: Wed, 8 Jul 2015 18:48:33 -0400 Subject: [PATCH] Document the version of Unicode used to build the regexes. Former-commit-id: 9f8464c2d1dfbd870d38bff697a76582a4a3f1ff --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index 73afd99..c16b7d0 100644 --- a/README.md +++ b/README.md @@ -21,6 +21,12 @@ install them on Ubuntu: sudo apt-get install mecab-ipadic-utf8 libmecab-dev pip3 install mecab-python3 +## Unicode data + +The tokenizers used to split non-Japanese phrases use regexes built using the +`unicodedata` module from Python 3.4, which uses Unicode version 6.3.0. To +update these regexes, run `scripts/gen_regex.py`. + ## License `wordfreq` is freely redistributable under the MIT license (see