From 89617294017bb88688c98ec1586187a469814829 Mon Sep 17 00:00:00 2001
From: Andrew Lin <alin@luminoso.com>
Date: Wed, 8 Jul 2015 18:48:33 -0400
Subject: [PATCH] Document the version of Unicode used to build the regexes.

Former-commit-id: 9f8464c2d1dfbd870d38bff697a76582a4a3f1ff
---
 README.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/README.md b/README.md
index 73afd99..c16b7d0 100644
--- a/README.md
+++ b/README.md
@@ -21,6 +21,12 @@ install them on Ubuntu:
     sudo apt-get install mecab-ipadic-utf8 libmecab-dev
     pip3 install mecab-python3
 
+## Unicode data
+
+The tokenizers used to split non-Japanese phrases use regexes built using the
+`unicodedata` module from Python 3.4, which uses Unicode version 6.3.0.  To
+update these regexes, run `scripts/gen_regex.py`.
+
 ## License
 
 `wordfreq` is freely redistributable under the MIT license (see