update and clean up the tokenize() docstring

Former-commit-id: 24b16d8a5d
This commit is contained in:
Robyn Speer 2015-09-24 17:47:16 -04:00
parent 4a4534c466
commit 960dc437a2

View File

@ -127,19 +127,25 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
- Chinese will be mapped to Simplified Chinese characters and tokenized - Chinese will be mapped to Simplified Chinese characters and tokenized
using the jieba tokenizer, on a custom word list of words that can be using the jieba tokenizer, on a custom word list of words that can be
looked up in wordfreq. looked up in wordfreq.
- Japanese will be delegated to the external mecab-python module.
- Japanese will be delegated to the external mecab-python module. It will
be NFKC normalized, which is stronger than NFC normalization.
- Chinese or Japanese texts that aren't identified as the appropriate - Chinese or Japanese texts that aren't identified as the appropriate
language will only split on punctuation and script boundaries, giving language will only split on punctuation and script boundaries, giving
you untokenized globs of characters that probably represent many words. you untokenized globs of characters that probably represent many words.
- Arabic will be NFKC normalized, and will have Arabic-specific combining
marks and tatweels removed.
- Languages written in cased alphabets will be case-folded to lowercase.
- Turkish will use a different case-folding procedure, so that capital - Turkish will use a different case-folding procedure, so that capital
I and İ map to ı and i respectively. I and İ map to ı and i respectively.
- All other languages will be tokenized using a regex that mostly
implements the Word Segmentation section of Unicode Annex #29.
See `simple_tokenize` for details.
Additionally, the text will be case-folded to lowercase, and text marked - Languages besides Japanese and Chinese will be tokenized using a regex
as Arabic will be normalized more strongly and have combining marks and that mostly implements the Word Segmentation section of Unicode Annex
tatweels removed. #29. See `simple_tokenize` for details.
If `external_wordlist` is True, then the Chinese wordlist in wordfreq will If `external_wordlist` is True, then the Chinese wordlist in wordfreq will
not be used for tokenization. Instead, it will use the large wordlist not be used for tokenization. Instead, it will use the large wordlist