Update README to describe @ tokenization

2024-12-23 09:21:37 +00:00 · 2018-07-23 11:21:44 -04:00 · 2018-07-23 11:21:44 -04:00 · f73406c69a
commit f73406c69a
parent 86b928f967
3 changed files with 15 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,5 @@
+## Version 2.2
+
 ## Version 2.1 (2018-06-18)

 Data changes:
--- a/README.md
+++ b/README.md
@ -276,7 +276,8 @@ produces tokens that follow the recommendations in [Unicode
 Annex #29, Text Segmentation][uax29], including the optional rule that
 splits words between apostrophes and vowels.

-There are language-specific exceptions:
+There are exceptions where we change the tokenization to work better
+with certain languages:

 - In Arabic and Hebrew, it additionally normalizes ligatures and removes
  combining marks.
@ -288,11 +289,21 @@ There are language-specific exceptions:
 - In Chinese, it uses the external Python library `jieba`, another optional
  dependency.

+- While the @ sign is usually considered a symbol and not part of a word,
+  wordfreq will allow a word to end with "@" or "@s". This is one way of
+  writing gender-neutral words in Spanish and Portuguese.
+
 [uax29]: http://unicode.org/reports/tr29/

 When wordfreq's frequency lists are built in the first place, the words are
 tokenized according to this function.

+    >>> from wordfreq import tokenize
+    >>> tokenize('l@s niñ@s', 'es')
+    ['l@s', 'niñ@s']
+    >>> zipf_frequency('l@s', 'es')
+    2.8
+
 Because tokenization in the real world is far from consistent, wordfreq will
 also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:
--- a/setup.py
+++ b/setup.py
@ -35,7 +35,7 @@ if sys.version_info < (3, 4):

 setup(
    name="wordfreq",
-    version='2.1.0',
+    version='2.2.0',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',