Merge pull request #91 from LuminosoInsight/data-update-2.5

Version 2.5, incorporating OSCAR data
2024-12-23 09:21:37 +00:00 · 2021-04-15 14:32:10 -04:00 · 2021-04-15 14:32:10 -04:00 · b13d35e503
commit b13d35e503
parent 32093d9efc 16122083b3
67 changed files with 38864 additions and 37478 deletions
--- a/README.md
+++ b/README.md
@ -45,16 +45,16 @@ frequency as a decimal between 0 and 1.
    >>> from wordfreq import word_frequency
    >>> word_frequency('cafe', 'en')
-    1.05e-05
+    1.23e-05
    >>> word_frequency('café', 'en')
    5.62e-06
    >>> word_frequency('cafe', 'fr')
-    1.55e-06
+    1.51e-06
    >>> word_frequency('café', 'fr')
-    6.61e-05
+    5.75e-05
 `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -72,16 +72,16 @@ one occurrence per billion words.
    >>> from wordfreq import zipf_frequency
    >>> zipf_frequency('the', 'en')
-    7.76
+    7.73
    >>> zipf_frequency('word', 'en')
    5.26
    >>> zipf_frequency('frequency', 'en')
-    4.48
+    4.36
    >>> zipf_frequency('zipf', 'en')
-    1.62
+    1.49
    >>> zipf_frequency('zipf', 'en', wordlist='small')
    0.0
@ -167,41 +167,49 @@ least 3 different sources of word frequencies:
    Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
    ──────────────────────────────┼────────────────────────────────────────────────
    Arabic      ar      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Bengali     bn      3  -      │ Yes   -     Yes   -     -     Yes   -     -
+    Bangla      bn      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Bosnian     bs [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Bulgarian   bg      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Bulgarian   bg      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
-    Catalan     ca      4  -      │ Yes   Yes   Yes   -     -     Yes   -     -
+    Catalan     ca      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Chinese     zh [3]  7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     Jieba
    Croatian    hr [1]  3         │ Yes   Yes   -     -     -     Yes   -     -
    Czech       cs      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Danish      da      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Danish      da      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Dutch       nl      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    English     en      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Finnish     fi      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -
    French      fr      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Greek       el      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Greek       el      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
-    Hebrew      he      4  -      │ Yes   Yes   -     Yes   -     Yes   -     -
+    Hebrew      he      5  Yes    │ Yes   Yes   -     Yes   Yes   Yes   -     -
-    Hindi       hi      3  -      │ Yes   -     -     -     -     Yes   Yes   -
+    Hindi       hi      4  Yes    │ Yes   -     -     -     Yes   Yes   Yes   -
-    Hungarian   hu      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Hungarian   hu      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Icelandic   is      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Indonesian  id      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Italian     it      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
    Korean      ko      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
    Latvian     lv      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
-    Macedonian  mk      3  -      │ Yes   Yes   Yes   -     -     -     -     -
+    Lithuanian  lt      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Macedonian  mk      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Norwegian   nb [2]  4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Norwegian   nb [2]  5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
-    Persian     fa      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Persian     fa      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Polish      pl      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -
    Portuguese  pt      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Romanian    ro      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
+    Romanian    ro      3  -      │ Yes   Yes   -     -     Yes   -     -     -
-    Russian     ru      6  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     -
+    Russian     ru      5  Yes    │ Yes   Yes   Yes   Yes   -     Yes   -     -
    Slovak      sl      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Slovenian   sk      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Serbian     sr [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Swedish     sv      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Swedish     sv      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
-    Turkish     tr      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Tagalog     fil     3  -      │ Yes   Yes   -     -     Yes   -     -     -
-    Ukrainian   uk      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Tamil       ta      3  -      │ Yes   -     -     -     Yes   Yes   -     -
    Turkish     tr      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Ukrainian   uk      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
    Urdu        ur      3  -      │ Yes   -     -     -     Yes   Yes   -     -
    Vietnamese  vi      3  -      │ Yes   Yes   -     -     Yes   -     -     -
 [1] Bosnian, Croatian, and Serbian use the same underlying word list, because
 they share most of their vocabulary and grammar, they were once considered the
@ -232,7 +240,7 @@ the list, in descending frequency order.
    >>> from wordfreq import top_n_list
    >>> top_n_list('en', 10)
-    ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'for', 'that']
+    ['the', 'to', 'and', 'of', 'a', 'in', 'i', 'is', 'for', 'that']
    >>> top_n_list('es', 10)
    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'un']
@ -302,16 +310,16 @@ tokenized according to this function.
    >>> tokenize('l@s niñ@s', 'es')
    ['l@s', 'niñ@s']
    >>> zipf_frequency('l@s', 'es')
-    2.82
+    3.03
 Because tokenization in the real world is far from consistent, wordfreq will
 also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:
    >>> zipf_frequency('New York', 'en')
-    5.3
+    5.32
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.23
+    3.29
 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -326,7 +334,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:
    >>> zipf_frequency('owl-flavored', 'en')
-    3.29
+    3.3
 ## Multi-script languages
@ -387,7 +395,7 @@ the 'cjk' feature:
    pip install wordfreq[cjk]
 Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
-on `mecab-python` and `ipadic`, and tokenizing Korean depends on `mecab-python`
+on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
 and `mecab-ko-dic`.
 As of version 2.4.2, you no longer have to install dictionaries separately.
@ -523,6 +531,12 @@ The same citation in BibTex format:
  International Conference on Language Resources and Evaluation (LREC 2016).
  http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
 - Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
  for processing huge corpora on medium to low resource infrastructures. In
  Proceedings of the Workshop on Challenges in the Management of Large Corpora
  (CMLC-7) 2019.
  https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
 - ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
  European Languages. https://paracrawl.eu/
--- a/setup.py
+++ b/setup.py
@ -33,7 +33,7 @@ dependencies = [
 setup(
    name="wordfreq",
-    version='2.4.2',
+    version='2.5.0',
    maintainer='Robyn Speer',
    maintainer_email='rspeer@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
@ -49,9 +49,8 @@ setup(
    install_requires=dependencies,
    # mecab-python3 is required for looking up Japanese or Korean word
-    # frequencies. In turn, it depends on libmecab-dev being installed on the
+    # frequencies. It's not listed under 'install_requires' because wordfreq
-    # system. It's not listed under 'install_requires' because wordfreq should
+    # should be usable in other languages without it.
    # be usable in other languages without it.
    #
    # Similarly, jieba is required for Chinese word frequencies.
    extras_require={
--- a/tests/test_general.py
+++ b/tests/test_general.py
@ -60,18 +60,45 @@ def test_most_common_words():
        return top_n_list(lang, 1)[0]
    assert get_most_common('ar') == 'في'
    assert get_most_common('bg') == 'на'
    assert get_most_common('bn') == 'না'
    assert get_most_common('ca') == 'de'
    assert get_most_common('cs') == 'a'
    assert get_most_common('da') == 'i'
    assert get_most_common('el') == 'και'
    assert get_most_common('de') == 'die'
    assert get_most_common('en') == 'the'
    assert get_most_common('es') == 'de'
    assert get_most_common('fi') == 'ja'
    assert get_most_common('fil') == 'sa'
    assert get_most_common('fr') == 'de'
    assert get_most_common('he') == 'את'
    assert get_most_common('hi') == 'के'
    assert get_most_common('hu') == 'a'
    assert get_most_common('id') == 'yang'
    assert get_most_common('is') == 'og'
    assert get_most_common('it') == 'di'
    assert get_most_common('ja') == 'の'
    assert get_most_common('ko') == '이'
    assert get_most_common('lt') == 'ir'
    assert get_most_common('lv') == 'un'
    assert get_most_common('mk') == 'на'
    assert get_most_common('ms') == 'yang'
    assert get_most_common('nb') == 'i'
    assert get_most_common('nl') == 'de'
    assert get_most_common('pl') == 'w'
    assert get_most_common('pt') == 'de'
    assert get_most_common('ro') == 'de'
    assert get_most_common('ru') == 'в'
-    assert get_most_common('tr') == 'bir'
+    assert get_most_common('sh') == 'je'
    assert get_most_common('sk') == 'a'
    assert get_most_common('sl') == 'je'
    assert get_most_common('sv') == 'är'
    assert get_most_common('ta') == 'ஒரு'
    assert get_most_common('tr') == 've'
    assert get_most_common('uk') == 'в'
    assert get_most_common('ur') == 'کے'
    assert get_most_common('vi') == 'là'
    assert get_most_common('zh') == '的'
--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/large_ar.msgpack.gz
+++ b/wordfreq/data/large_ar.msgpack.gz
--- a/wordfreq/data/large_bn.msgpack.gz
+++ b/wordfreq/data/large_bn.msgpack.gz
--- a/wordfreq/data/large_ca.msgpack.gz
+++ b/wordfreq/data/large_ca.msgpack.gz
--- a/wordfreq/data/large_cs.msgpack.gz
+++ b/wordfreq/data/large_cs.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fi.msgpack.gz
+++ b/wordfreq/data/large_fi.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_he.msgpack.gz
+++ b/wordfreq/data/large_he.msgpack.gz
--- a/wordfreq/data/large_it.msgpack.gz
+++ b/wordfreq/data/large_it.msgpack.gz
--- a/wordfreq/data/large_ja.msgpack.gz
+++ b/wordfreq/data/large_ja.msgpack.gz
--- a/wordfreq/data/large_mk.msgpack.gz
+++ b/wordfreq/data/large_mk.msgpack.gz
--- a/wordfreq/data/large_nb.msgpack.gz
+++ b/wordfreq/data/large_nb.msgpack.gz
--- a/wordfreq/data/large_nl.msgpack.gz
+++ b/wordfreq/data/large_nl.msgpack.gz
--- a/wordfreq/data/large_pl.msgpack.gz
+++ b/wordfreq/data/large_pl.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/large_ru.msgpack.gz
+++ b/wordfreq/data/large_ru.msgpack.gz
--- a/wordfreq/data/large_sv.msgpack.gz
+++ b/wordfreq/data/large_sv.msgpack.gz
--- a/wordfreq/data/large_uk.msgpack.gz
+++ b/wordfreq/data/large_uk.msgpack.gz
--- a/wordfreq/data/large_zh.msgpack.gz
+++ b/wordfreq/data/large_zh.msgpack.gz
--- a/wordfreq/data/small_ar.msgpack.gz
+++ b/wordfreq/data/small_ar.msgpack.gz
--- a/wordfreq/data/small_bg.msgpack.gz
+++ b/wordfreq/data/small_bg.msgpack.gz
--- a/wordfreq/data/small_bn.msgpack.gz
+++ b/wordfreq/data/small_bn.msgpack.gz
--- a/wordfreq/data/small_ca.msgpack.gz
+++ b/wordfreq/data/small_ca.msgpack.gz
--- a/wordfreq/data/small_cs.msgpack.gz
+++ b/wordfreq/data/small_cs.msgpack.gz
--- a/wordfreq/data/small_da.msgpack.gz
+++ b/wordfreq/data/small_da.msgpack.gz
--- a/wordfreq/data/small_de.msgpack.gz
+++ b/wordfreq/data/small_de.msgpack.gz
--- a/wordfreq/data/small_el.msgpack.gz
+++ b/wordfreq/data/small_el.msgpack.gz
--- a/wordfreq/data/small_en.msgpack.gz
+++ b/wordfreq/data/small_en.msgpack.gz
--- a/wordfreq/data/small_es.msgpack.gz
+++ b/wordfreq/data/small_es.msgpack.gz
--- a/wordfreq/data/small_fa.msgpack.gz
+++ b/wordfreq/data/small_fa.msgpack.gz
--- a/wordfreq/data/small_fi.msgpack.gz
+++ b/wordfreq/data/small_fi.msgpack.gz
--- a/wordfreq/data/small_fil.msgpack.gz
+++ b/wordfreq/data/small_fil.msgpack.gz
--- a/wordfreq/data/small_fr.msgpack.gz
+++ b/wordfreq/data/small_fr.msgpack.gz
--- a/wordfreq/data/small_he.msgpack.gz
+++ b/wordfreq/data/small_he.msgpack.gz
--- a/wordfreq/data/small_hi.msgpack.gz
+++ b/wordfreq/data/small_hi.msgpack.gz
--- a/wordfreq/data/small_hu.msgpack.gz
+++ b/wordfreq/data/small_hu.msgpack.gz
--- a/wordfreq/data/small_id.msgpack.gz
+++ b/wordfreq/data/small_id.msgpack.gz
--- a/wordfreq/data/small_is.msgpack.gz
+++ b/wordfreq/data/small_is.msgpack.gz
--- a/wordfreq/data/small_it.msgpack.gz
+++ b/wordfreq/data/small_it.msgpack.gz
--- a/wordfreq/data/small_ja.msgpack.gz
+++ b/wordfreq/data/small_ja.msgpack.gz
--- a/wordfreq/data/small_ko.msgpack.gz
+++ b/wordfreq/data/small_ko.msgpack.gz
--- a/wordfreq/data/small_lt.msgpack.gz
+++ b/wordfreq/data/small_lt.msgpack.gz
--- a/wordfreq/data/small_lv.msgpack.gz
+++ b/wordfreq/data/small_lv.msgpack.gz
--- a/wordfreq/data/small_mk.msgpack.gz
+++ b/wordfreq/data/small_mk.msgpack.gz
--- a/wordfreq/data/small_ms.msgpack.gz
+++ b/wordfreq/data/small_ms.msgpack.gz
--- a/wordfreq/data/small_nb.msgpack.gz
+++ b/wordfreq/data/small_nb.msgpack.gz
--- a/wordfreq/data/small_nl.msgpack.gz
+++ b/wordfreq/data/small_nl.msgpack.gz
--- a/wordfreq/data/small_pl.msgpack.gz
+++ b/wordfreq/data/small_pl.msgpack.gz
--- a/wordfreq/data/small_pt.msgpack.gz
+++ b/wordfreq/data/small_pt.msgpack.gz
--- a/wordfreq/data/small_ro.msgpack.gz
+++ b/wordfreq/data/small_ro.msgpack.gz
--- a/wordfreq/data/small_ru.msgpack.gz
+++ b/wordfreq/data/small_ru.msgpack.gz
--- a/wordfreq/data/small_sh.msgpack.gz
+++ b/wordfreq/data/small_sh.msgpack.gz
--- a/wordfreq/data/small_sk.msgpack.gz
+++ b/wordfreq/data/small_sk.msgpack.gz
--- a/wordfreq/data/small_sl.msgpack.gz
+++ b/wordfreq/data/small_sl.msgpack.gz
--- a/wordfreq/data/small_sv.msgpack.gz
+++ b/wordfreq/data/small_sv.msgpack.gz
--- a/wordfreq/data/small_ta.msgpack.gz
+++ b/wordfreq/data/small_ta.msgpack.gz
--- a/wordfreq/data/small_tr.msgpack.gz
+++ b/wordfreq/data/small_tr.msgpack.gz
--- a/wordfreq/data/small_uk.msgpack.gz
+++ b/wordfreq/data/small_uk.msgpack.gz
--- a/wordfreq/data/small_ur.msgpack.gz
+++ b/wordfreq/data/small_ur.msgpack.gz
--- a/wordfreq/data/small_vi.msgpack.gz
+++ b/wordfreq/data/small_vi.msgpack.gz
--- a/wordfreq/data/small_zh.msgpack.gz
+++ b/wordfreq/data/small_zh.msgpack.gz