mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Update README to describe @ tokenization
This commit is contained in:
parent
86b928f967
commit
f73406c69a
@ -1,3 +1,5 @@
|
||||
## Version 2.2
|
||||
|
||||
## Version 2.1 (2018-06-18)
|
||||
|
||||
Data changes:
|
||||
|
13
README.md
13
README.md
@ -276,7 +276,8 @@ produces tokens that follow the recommendations in [Unicode
|
||||
Annex #29, Text Segmentation][uax29], including the optional rule that
|
||||
splits words between apostrophes and vowels.
|
||||
|
||||
There are language-specific exceptions:
|
||||
There are exceptions where we change the tokenization to work better
|
||||
with certain languages:
|
||||
|
||||
- In Arabic and Hebrew, it additionally normalizes ligatures and removes
|
||||
combining marks.
|
||||
@ -288,11 +289,21 @@ There are language-specific exceptions:
|
||||
- In Chinese, it uses the external Python library `jieba`, another optional
|
||||
dependency.
|
||||
|
||||
- While the @ sign is usually considered a symbol and not part of a word,
|
||||
wordfreq will allow a word to end with "@" or "@s". This is one way of
|
||||
writing gender-neutral words in Spanish and Portuguese.
|
||||
|
||||
[uax29]: http://unicode.org/reports/tr29/
|
||||
|
||||
When wordfreq's frequency lists are built in the first place, the words are
|
||||
tokenized according to this function.
|
||||
|
||||
>>> from wordfreq import tokenize
|
||||
>>> tokenize('l@s niñ@s', 'es')
|
||||
['l@s', 'niñ@s']
|
||||
>>> zipf_frequency('l@s', 'es')
|
||||
2.8
|
||||
|
||||
Because tokenization in the real world is far from consistent, wordfreq will
|
||||
also try to deal gracefully when you query it with texts that actually break
|
||||
into multiple tokens:
|
||||
|
Loading…
Reference in New Issue
Block a user