Document the version of Unicode used to build the regexes.

Former-commit-id: 9f8464c2d1
This commit is contained in:
Andrew Lin 2015-07-08 18:48:33 -04:00
parent 8b3c5348e3
commit 8961729401

View File

@ -21,6 +21,12 @@ install them on Ubuntu:
sudo apt-get install mecab-ipadic-utf8 libmecab-dev sudo apt-get install mecab-ipadic-utf8 libmecab-dev
pip3 install mecab-python3 pip3 install mecab-python3
## Unicode data
The tokenizers used to split non-Japanese phrases use regexes built using the
`unicodedata` module from Python 3.4, which uses Unicode version 6.3.0. To
update these regexes, run `scripts/gen_regex.py`.
## License ## License
`wordfreq` is freely redistributable under the MIT license (see `wordfreq` is freely redistributable under the MIT license (see