mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-24 01:41:39 +00:00
update and clean up the tokenize() docstring
Former-commit-id: 24b16d8a5d
This commit is contained in:
parent
4a4534c466
commit
960dc437a2
@ -127,19 +127,25 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
|||||||
- Chinese will be mapped to Simplified Chinese characters and tokenized
|
- Chinese will be mapped to Simplified Chinese characters and tokenized
|
||||||
using the jieba tokenizer, on a custom word list of words that can be
|
using the jieba tokenizer, on a custom word list of words that can be
|
||||||
looked up in wordfreq.
|
looked up in wordfreq.
|
||||||
- Japanese will be delegated to the external mecab-python module.
|
|
||||||
|
- Japanese will be delegated to the external mecab-python module. It will
|
||||||
|
be NFKC normalized, which is stronger than NFC normalization.
|
||||||
|
|
||||||
- Chinese or Japanese texts that aren't identified as the appropriate
|
- Chinese or Japanese texts that aren't identified as the appropriate
|
||||||
language will only split on punctuation and script boundaries, giving
|
language will only split on punctuation and script boundaries, giving
|
||||||
you untokenized globs of characters that probably represent many words.
|
you untokenized globs of characters that probably represent many words.
|
||||||
|
|
||||||
|
- Arabic will be NFKC normalized, and will have Arabic-specific combining
|
||||||
|
marks and tatweels removed.
|
||||||
|
|
||||||
|
- Languages written in cased alphabets will be case-folded to lowercase.
|
||||||
|
|
||||||
- Turkish will use a different case-folding procedure, so that capital
|
- Turkish will use a different case-folding procedure, so that capital
|
||||||
I and İ map to ı and i respectively.
|
I and İ map to ı and i respectively.
|
||||||
- All other languages will be tokenized using a regex that mostly
|
|
||||||
implements the Word Segmentation section of Unicode Annex #29.
|
|
||||||
See `simple_tokenize` for details.
|
|
||||||
|
|
||||||
Additionally, the text will be case-folded to lowercase, and text marked
|
- Languages besides Japanese and Chinese will be tokenized using a regex
|
||||||
as Arabic will be normalized more strongly and have combining marks and
|
that mostly implements the Word Segmentation section of Unicode Annex
|
||||||
tatweels removed.
|
#29. See `simple_tokenize` for details.
|
||||||
|
|
||||||
If `external_wordlist` is True, then the Chinese wordlist in wordfreq will
|
If `external_wordlist` is True, then the Chinese wordlist in wordfreq will
|
||||||
not be used for tokenization. Instead, it will use the large wordlist
|
not be used for tokenization. Instead, it will use the large wordlist
|
||||||
|
Loading…
Reference in New Issue
Block a user