clarify the tokenize docstring

Former-commit-id: bfc17fea9f
This commit is contained in:
Robyn Speer 2015-10-19 12:18:12 -04:00
parent c9693c9502
commit 91a81c1bde

View File

@ -119,13 +119,14 @@ jieba_tokenize = None
def tokenize(text, lang, include_punctuation=False, external_wordlist=False): def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
""" """
Tokenize this text in a way that's relatively simple but appropriate for Tokenize this text in a way that's relatively simple but appropriate for
the language. the language. Strings that are looked up in wordfreq will be run through
this function first, so that they can be expected to match the data.
So far, this means: Here is what the tokenizer will do, depending on the language:
- Chinese will be mapped to Simplified Chinese characters and tokenized - Chinese will be mapped to Simplified Chinese characters and tokenized
using the jieba tokenizer, on a custom word list of words that can be using the Jieba tokenizer, trained on a custom word list of words that
looked up in wordfreq. can be looked up in wordfreq.
- Japanese will be delegated to the external mecab-python module. It will - Japanese will be delegated to the external mecab-python module. It will
be NFKC normalized, which is stronger than NFC normalization. be NFKC normalized, which is stronger than NFC normalization.
@ -146,15 +147,12 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
that mostly implements the Word Segmentation section of Unicode Annex that mostly implements the Word Segmentation section of Unicode Annex
#29. See `simple_tokenize` for details. #29. See `simple_tokenize` for details.
If `external_wordlist` is True, then the Chinese wordlist in wordfreq will The `external_wordlist` option only affects Chinese tokenization. If it's
not be used for tokenization. Instead, it will use the large wordlist True, then wordfreq will not use its own Chinese wordlist for tokenization.
packaged with the Jieba tokenizer, and it will leave Traditional Chinese Instead, it will use the large wordlist packaged with the Jieba tokenizer,
characters as is. This will probably give more accurate tokenization, but and it will leave Traditional Chinese characters as is. This will probably
the resulting tokens won't necessarily have word frequencies that can be give more accurate tokenization, but the resulting tokens won't necessarily
looked up. have word frequencies that can be looked up.
Strings that are looked up in wordfreq will be run through this function
first, so that they can be expected to match the data.
""" """
if lang == 'ja': if lang == 'ja':
return japanese_tokenize(text, include_punctuation) return japanese_tokenize(text, include_punctuation)