Merge pull request #51 from LuminosoInsight/version1.7

Version 1.7: update tokenization, update Wikipedia data, add languages
2024-12-23 09:21:37 +00:00 · 2017-09-08 17:02:05 -04:00 · 2017-09-08 17:02:05 -04:00 · 95a13ab4ce
commit 95a13ab4ce
parent 9dac967ca3 b042f2be9d
81 changed files with 25728 additions and 25534 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,23 @@
+## Version 1.7.0 (2017-08-25)
+
+- Tokenization will always keep Unicode graphemes together, including
+  complex emoji introduced in Unicode 10
+- Update the Wikipedia source data to April 2017
+- Remove some non-words, such as the Unicode replacement character and the
+  pilcrow sign, from frequency lists
+- Support Bengali and Macedonian, which passed the threshold of having enough
+  source data to be included
+
+
+## Version 1.6.1 (2017-05-10)
+
+- Depend on langcodes 1.4, with a new language-matching system that does not
+  depend on SQLite.
+
+  This prevents silly conflicts where langcodes' SQLite connection was
+  preventing langcodes from being used in threads.
+
+
 ## Version 1.6.0 (2017-01-05)

 - Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
--- a/README.md
+++ b/README.md
@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:

    python3 setup.py install

-
-## Additional CJK installation
-
-Chinese, Japanese, and Korean have additional external dependencies so that
-they can be tokenized correctly. Here we'll explain how to set them up,
-in increasing order of difficulty.
-
-
-### Chinese
-
-To be able to look up word frequencies in Chinese, you need Jieba, a
-pure-Python Chinese tokenizer:
-
-    pip3 install jieba
-
-
-### Japanese
-
-We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
-things need to be installed:
-
-  * The MeCab development library (called `libmecab-dev` on Ubuntu)
-  * The UTF-8 version of the `ipadic` Japanese dictionary
-    (called `mecab-ipadic-utf8` on Ubuntu)
-  * The `mecab-python3` Python interface
-
-To install these three things on Ubuntu, you can run:
-
-```sh
-sudo apt-get install libmecab-dev mecab-ipadic-utf8
-pip3 install mecab-python3
-```
-
-If you choose to install `ipadic` from somewhere else or from its source code,
-be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
-give you nonsense results.
-
-
-### Korean
-
-Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
-Yungho Yu. This dictionary is not available as an Ubuntu package.
-
-Here's a process you can use to install the Korean dictionary and the other
-MeCab dependencies:
-
-```sh
-sudo apt-get install libmecab-dev mecab-utils
-pip3 install mecab-python3
-wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
-tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
-cd mecab-ko-dic-2.0.1-20150920
-./autogen.sh
-make
-sudo make install
-```
-
-If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
-tokenize those languages, it will raise an error and show you the list of
-paths it searched.
-
-Sorry that this is difficult. We tried to just package the data files we need
-with wordfreq, like we do for Chinese, but PyPI would reject the package for
-being too large.
+See [Additional CJK installation][#additional-cjk-installation] for extra
+steps that are necessary to get Chinese, Japanese, and Korean word frequencies.


 ## Usage
@ -175,10 +113,10 @@ the list, in descending frequency order.

    >>> from wordfreq import top_n_list
    >>> top_n_list('en', 10)
-    ['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
+    ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']

    >>> top_n_list('es', 10)
-    ['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
+    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']

 `iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
 wordlist, in descending frequency order.
@ -197,10 +135,12 @@ will select each random word from 2^n words.

 If you happen to want an easy way to get [a memorable, xkcd-style
 password][xkcd936] with 60 bits of entropy, this function will almost do the
-job. In this case, you should actually run the similar function `random_ascii_words`,
-limiting the selection to words that can be typed in ASCII.
+job. In this case, you should actually run the similar function
+`random_ascii_words`, limiting the selection to words that can be typed in
+ASCII. But maybe you should just use [xkpa][].

 [xkcd936]: https://xkcd.com/936/
+[xkpa]: https://github.com/beala/xkcd-password


 ## Sources and supported languages
@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
    Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
    ──────────────────────────────┼────────────────────────────────────────────────
    Arabic      ar      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Bosnian     bs [1]  3         │ Yes   Yes   -     -     -     Yes   -     -
+    Bengali     bn      3  -      │ Yes   -     Yes   -     -     Yes   -     -
+    Bosnian     bs [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Bulgarian   bg      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Catalan     ca      4  -      │ Yes   Yes   Yes   -     -     Yes   -     -
+    Chinese     zh [3]  6  Yes    │ Yes   -     Yes   Yes   Yes   Yes   -     Jieba
+    Croatian    hr [1]  3         │ Yes   Yes   -     -     -     Yes   -     -
    Czech       cs      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Danish      da      3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Greek       el      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Dutch       nl      4  Yes    │ Yes   Yes   Yes   -     -     Yes   -     -
    English     en      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Persian     fa      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Finnish     fi      5  Yes    │ Yes   Yes   Yes   -     -     Yes   Yes   -
    French      fr      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Greek       el      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Hebrew      he      4  -      │ Yes   Yes   -     Yes   -     Yes   -     -
    Hindi       hi      3  -      │ Yes   -     -     -     -     Yes   Yes   -
-    Croatian    hr [1]  3         │ Yes   Yes   -     -     -     Yes   -     -
    Hungarian   hu      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Indonesian  id      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Italian     it      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
    Korean      ko      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Macedonian  mk      3  -      │ Yes   Yes   Yes   -     -     -     -     -
    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Norwegian   nb [2]  4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
-    Dutch       nl      4  Yes    │ Yes   Yes   Yes   -     -     Yes   -     -
+    Persian     fa      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Polish      pl      5  Yes    │ Yes   Yes   Yes   -     -     Yes   Yes   -
    Portuguese  pt      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Romanian    ro      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Russian     ru      6  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     -
    Serbian     sr [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Swedish     sv      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
    Turkish     tr      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Ukrainian   uk      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
-    Chinese     zh [3]  6  Yes    │ Yes   -     Yes   Yes   Yes   Yes   -     Jieba

 [1] Bosnian, Croatian, and Serbian use the same underlying word list, because
 they share most of their vocabulary and grammar, they were once considered the
@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
 languages" below.

 Some languages provide 'large' wordlists, including words with a Zipf frequency
-between 1.0 and 3.0. These are available in 12 languages that are covered by
+between 1.0 and 3.0. These are available in 13 languages that are covered by
 enough data sources.


@ -314,7 +256,7 @@ into multiple tokens:
    >>> zipf_frequency('New York', 'en')
    5.35
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.56
+    3.55

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
 Simplified Chinese), you will get the `zh` wordlist, for example.


+## Additional CJK installation
+
+Chinese, Japanese, and Korean have additional external dependencies so that
+they can be tokenized correctly. Here we'll explain how to set them up,
+in increasing order of difficulty.
+
+
+### Chinese
+
+To be able to look up word frequencies in Chinese, you need Jieba, a
+pure-Python Chinese tokenizer:
+
+    pip3 install jieba
+
+
+### Japanese
+
+We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
+things need to be installed:
+
+  * The MeCab development library (called `libmecab-dev` on Ubuntu)
+  * The UTF-8 version of the `ipadic` Japanese dictionary
+    (called `mecab-ipadic-utf8` on Ubuntu)
+  * The `mecab-python3` Python interface
+
+To install these three things on Ubuntu, you can run:
+
+```sh
+sudo apt-get install libmecab-dev mecab-ipadic-utf8
+pip3 install mecab-python3
+```
+
+If you choose to install `ipadic` from somewhere else or from its source code,
+be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
+give you nonsense results.
+
+
+### Korean
+
+Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
+Yungho Yu. This dictionary is not available as an Ubuntu package.
+
+Here's a process you can use to install the Korean dictionary and the other
+MeCab dependencies:
+
+```sh
+sudo apt-get install libmecab-dev mecab-utils
+pip3 install mecab-python3
+wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
+tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
+cd mecab-ko-dic-2.0.1-20150920
+./autogen.sh
+make
+sudo make install
+```
+
+If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
+tokenize those languages, it will raise an error and show you the list of
+paths it searched.
+
+Sorry that this is difficult. We tried to just package the data files we need
+with wordfreq, like we do for Chinese, but PyPI would reject the package for
+being too large.
+
+
 ## License

 `wordfreq` is freely redistributable under the MIT license (see
--- a/scripts/top_n.py
+++ b/scripts/top_n.py
@ -0,0 +1,14 @@
+"""
+A quick script to output the top N words (1000 for now) in each language.
+You can send the output to a file and diff it to see changes between wordfreq
+versions.
+"""
+import wordfreq
+
+
+N = 1000
+
+
+for lang in sorted(wordfreq.available_languages()):
+    for word in wordfreq.top_n_list(lang, 1000):
+        print('{}\t{}'.format(lang, word))
--- a/tests/test.py
+++ b/tests/test.py
@ -35,6 +35,8 @@ LAUGHTER_WORDS = {
    'he': 'חחח',
    'bg': 'ахаха',
    'uk': 'хаха',
+    'bn': 'হা হা',
+    'mk': 'хаха'
 }


@ -190,7 +192,7 @@ def test_not_really_random():
    # This not only tests random_ascii_words, it makes sure we didn't end
    # up with 'eos' as a very common Japanese word
    eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
-        '00 00 00 00')
+        '1 1 1 1')


@raises(ValueError)
--- a/wordfreq/data/combined_ar.msgpack.gz
+++ b/wordfreq/data/combined_ar.msgpack.gz
--- a/wordfreq/data/combined_bg.msgpack.gz
+++ b/wordfreq/data/combined_bg.msgpack.gz
--- a/wordfreq/data/combined_bn.msgpack.gz
+++ b/wordfreq/data/combined_bn.msgpack.gz
--- a/wordfreq/data/combined_ca.msgpack.gz
+++ b/wordfreq/data/combined_ca.msgpack.gz
--- a/wordfreq/data/combined_cs.msgpack.gz
+++ b/wordfreq/data/combined_cs.msgpack.gz
--- a/wordfreq/data/combined_da.msgpack.gz
+++ b/wordfreq/data/combined_da.msgpack.gz
--- a/wordfreq/data/combined_de.msgpack.gz
+++ b/wordfreq/data/combined_de.msgpack.gz
--- a/wordfreq/data/combined_el.msgpack.gz
+++ b/wordfreq/data/combined_el.msgpack.gz
--- a/wordfreq/data/combined_en.msgpack.gz
+++ b/wordfreq/data/combined_en.msgpack.gz
--- a/wordfreq/data/combined_es.msgpack.gz
+++ b/wordfreq/data/combined_es.msgpack.gz
--- a/wordfreq/data/combined_fa.msgpack.gz
+++ b/wordfreq/data/combined_fa.msgpack.gz
--- a/wordfreq/data/combined_fi.msgpack.gz
+++ b/wordfreq/data/combined_fi.msgpack.gz
--- a/wordfreq/data/combined_fr.msgpack.gz
+++ b/wordfreq/data/combined_fr.msgpack.gz
--- a/wordfreq/data/combined_he.msgpack.gz
+++ b/wordfreq/data/combined_he.msgpack.gz
--- a/wordfreq/data/combined_hi.msgpack.gz
+++ b/wordfreq/data/combined_hi.msgpack.gz
--- a/wordfreq/data/combined_hu.msgpack.gz
+++ b/wordfreq/data/combined_hu.msgpack.gz
--- a/wordfreq/data/combined_id.msgpack.gz
+++ b/wordfreq/data/combined_id.msgpack.gz
--- a/wordfreq/data/combined_it.msgpack.gz
+++ b/wordfreq/data/combined_it.msgpack.gz
--- a/wordfreq/data/combined_ja.msgpack.gz
+++ b/wordfreq/data/combined_ja.msgpack.gz
--- a/wordfreq/data/combined_ko.msgpack.gz
+++ b/wordfreq/data/combined_ko.msgpack.gz
--- a/wordfreq/data/combined_mk.msgpack.gz
+++ b/wordfreq/data/combined_mk.msgpack.gz
--- a/wordfreq/data/combined_ms.msgpack.gz
+++ b/wordfreq/data/combined_ms.msgpack.gz
--- a/wordfreq/data/combined_nb.msgpack.gz
+++ b/wordfreq/data/combined_nb.msgpack.gz
--- a/wordfreq/data/combined_nl.msgpack.gz
+++ b/wordfreq/data/combined_nl.msgpack.gz
--- a/wordfreq/data/combined_pl.msgpack.gz
+++ b/wordfreq/data/combined_pl.msgpack.gz
--- a/wordfreq/data/combined_pt.msgpack.gz
+++ b/wordfreq/data/combined_pt.msgpack.gz
--- a/wordfreq/data/combined_ro.msgpack.gz
+++ b/wordfreq/data/combined_ro.msgpack.gz
--- a/wordfreq/data/combined_ru.msgpack.gz
+++ b/wordfreq/data/combined_ru.msgpack.gz
--- a/wordfreq/data/combined_sh.msgpack.gz
+++ b/wordfreq/data/combined_sh.msgpack.gz
--- a/wordfreq/data/combined_sv.msgpack.gz
+++ b/wordfreq/data/combined_sv.msgpack.gz
--- a/wordfreq/data/combined_tr.msgpack.gz
+++ b/wordfreq/data/combined_tr.msgpack.gz
--- a/wordfreq/data/combined_uk.msgpack.gz
+++ b/wordfreq/data/combined_uk.msgpack.gz
--- a/wordfreq/data/combined_zh.msgpack.gz
+++ b/wordfreq/data/combined_zh.msgpack.gz
--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/large_ar.msgpack.gz
+++ b/wordfreq/data/large_ar.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fi.msgpack.gz
+++ b/wordfreq/data/large_fi.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_it.msgpack.gz
+++ b/wordfreq/data/large_it.msgpack.gz
--- a/wordfreq/data/large_ja.msgpack.gz
+++ b/wordfreq/data/large_ja.msgpack.gz
--- a/wordfreq/data/large_nl.msgpack.gz
+++ b/wordfreq/data/large_nl.msgpack.gz
--- a/wordfreq/data/large_pl.msgpack.gz
+++ b/wordfreq/data/large_pl.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/large_ru.msgpack.gz
+++ b/wordfreq/data/large_ru.msgpack.gz
--- a/wordfreq/data/large_zh.msgpack.gz
+++ b/wordfreq/data/large_zh.msgpack.gz
--- a/wordfreq/data/twitter_ar.msgpack.gz
+++ b/wordfreq/data/twitter_ar.msgpack.gz
--- a/wordfreq/data/twitter_bg.msgpack.gz
+++ b/wordfreq/data/twitter_bg.msgpack.gz
--- a/wordfreq/data/twitter_bn.msgpack.gz
+++ b/wordfreq/data/twitter_bn.msgpack.gz
--- a/wordfreq/data/twitter_ca.msgpack.gz
+++ b/wordfreq/data/twitter_ca.msgpack.gz
--- a/wordfreq/data/twitter_cs.msgpack.gz
+++ b/wordfreq/data/twitter_cs.msgpack.gz
--- a/wordfreq/data/twitter_da.msgpack.gz
+++ b/wordfreq/data/twitter_da.msgpack.gz
--- a/wordfreq/data/twitter_de.msgpack.gz
+++ b/wordfreq/data/twitter_de.msgpack.gz
--- a/wordfreq/data/twitter_en.msgpack.gz
+++ b/wordfreq/data/twitter_en.msgpack.gz
--- a/wordfreq/data/twitter_es.msgpack.gz
+++ b/wordfreq/data/twitter_es.msgpack.gz
--- a/wordfreq/data/twitter_fa.msgpack.gz
+++ b/wordfreq/data/twitter_fa.msgpack.gz
--- a/wordfreq/data/twitter_fi.msgpack.gz
+++ b/wordfreq/data/twitter_fi.msgpack.gz
--- a/wordfreq/data/twitter_fr.msgpack.gz
+++ b/wordfreq/data/twitter_fr.msgpack.gz
--- a/wordfreq/data/twitter_he.msgpack.gz
+++ b/wordfreq/data/twitter_he.msgpack.gz
--- a/wordfreq/data/twitter_hi.msgpack.gz
+++ b/wordfreq/data/twitter_hi.msgpack.gz
--- a/wordfreq/data/twitter_hu.msgpack.gz
+++ b/wordfreq/data/twitter_hu.msgpack.gz
--- a/wordfreq/data/twitter_id.msgpack.gz
+++ b/wordfreq/data/twitter_id.msgpack.gz
--- a/wordfreq/data/twitter_it.msgpack.gz
+++ b/wordfreq/data/twitter_it.msgpack.gz
--- a/wordfreq/data/twitter_ja.msgpack.gz
+++ b/wordfreq/data/twitter_ja.msgpack.gz
--- a/wordfreq/data/twitter_ko.msgpack.gz
+++ b/wordfreq/data/twitter_ko.msgpack.gz
--- a/wordfreq/data/twitter_ms.msgpack.gz
+++ b/wordfreq/data/twitter_ms.msgpack.gz
--- a/wordfreq/data/twitter_nb.msgpack.gz
+++ b/wordfreq/data/twitter_nb.msgpack.gz
--- a/wordfreq/data/twitter_nl.msgpack.gz
+++ b/wordfreq/data/twitter_nl.msgpack.gz
--- a/wordfreq/data/twitter_pl.msgpack.gz
+++ b/wordfreq/data/twitter_pl.msgpack.gz
--- a/wordfreq/data/twitter_pt.msgpack.gz
+++ b/wordfreq/data/twitter_pt.msgpack.gz
--- a/wordfreq/data/twitter_ro.msgpack.gz
+++ b/wordfreq/data/twitter_ro.msgpack.gz
--- a/wordfreq/data/twitter_ru.msgpack.gz
+++ b/wordfreq/data/twitter_ru.msgpack.gz
--- a/wordfreq/data/twitter_sh.msgpack.gz
+++ b/wordfreq/data/twitter_sh.msgpack.gz
--- a/wordfreq/data/twitter_sv.msgpack.gz
+++ b/wordfreq/data/twitter_sv.msgpack.gz
--- a/wordfreq/data/twitter_tr.msgpack.gz
+++ b/wordfreq/data/twitter_tr.msgpack.gz
--- a/wordfreq/data/twitter_uk.msgpack.gz
+++ b/wordfreq/data/twitter_uk.msgpack.gz