Merge pull request #91 from LuminosoInsight/data-update-2.5

Version 2.5, incorporating OSCAR data
2024-12-23 17:31:41 +00:00 · 2021-04-15 14:32:10 -04:00 · 2021-04-15 14:32:10 -04:00 · c56e633d53
commit c56e633d53
parent 4c0b29f460 2417ea0d39
67 changed files with 38864 additions and 37478 deletions
--- a/README.md
+++ b/README.md
@ -45,16 +45,16 @@ frequency as a decimal between 0 and 1.

    >>> from wordfreq import word_frequency
    >>> word_frequency('cafe', 'en')
-    1.05e-05
+    1.23e-05

    >>> word_frequency('café', 'en')
    5.62e-06

    >>> word_frequency('cafe', 'fr')
-    1.55e-06
+    1.51e-06

    >>> word_frequency('café', 'fr')
-    6.61e-05
+    5.75e-05


 `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -72,16 +72,16 @@ one occurrence per billion words.

    >>> from wordfreq import zipf_frequency
    >>> zipf_frequency('the', 'en')
-    7.76
+    7.73

    >>> zipf_frequency('word', 'en')
    5.26

    >>> zipf_frequency('frequency', 'en')
-    4.48
+    4.36

    >>> zipf_frequency('zipf', 'en')
-    1.62
+    1.49

    >>> zipf_frequency('zipf', 'en', wordlist='small')
    0.0
@ -167,41 +167,49 @@ least 3 different sources of word frequencies:
    Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
    ──────────────────────────────┼────────────────────────────────────────────────
    Arabic      ar      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Bengali     bn      3  -      │ Yes   -     Yes   -     -     Yes   -     -
+    Bangla      bn      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Bosnian     bs [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Bulgarian   bg      3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Catalan     ca      4  -      │ Yes   Yes   Yes   -     -     Yes   -     -
+    Bulgarian   bg      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
+    Catalan     ca      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Chinese     zh [3]  7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     Jieba
    Croatian    hr [1]  3         │ Yes   Yes   -     -     -     Yes   -     -
    Czech       cs      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Danish      da      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Danish      da      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Dutch       nl      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    English     en      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Finnish     fi      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -
    French      fr      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Greek       el      3  -      │ Yes   Yes   -     -     Yes   -     -     -
-    Hebrew      he      4  -      │ Yes   Yes   -     Yes   -     Yes   -     -
-    Hindi       hi      3  -      │ Yes   -     -     -     -     Yes   Yes   -
-    Hungarian   hu      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Greek       el      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
+    Hebrew      he      5  Yes    │ Yes   Yes   -     Yes   Yes   Yes   -     -
+    Hindi       hi      4  Yes    │ Yes   -     -     -     Yes   Yes   Yes   -
+    Hungarian   hu      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
+    Icelandic   is      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Indonesian  id      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Italian     it      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
    Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
    Korean      ko      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
    Latvian     lv      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
-    Macedonian  mk      3  -      │ Yes   Yes   Yes   -     -     -     -     -
+    Lithuanian  lt      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Macedonian  mk      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Norwegian   nb [2]  4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
-    Persian     fa      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Norwegian   nb [2]  5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
+    Persian     fa      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Polish      pl      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -
    Portuguese  pt      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
-    Romanian    ro      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
-    Russian     ru      6  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     -
+    Romanian    ro      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Russian     ru      5  Yes    │ Yes   Yes   Yes   Yes   -     Yes   -     -
+    Slovak      sl      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Slovenian   sk      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Serbian     sr [1]  3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
-    Swedish     sv      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
-    Turkish     tr      3  -      │ Yes   Yes   -     -     -     Yes   -     -
-    Ukrainian   uk      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Swedish     sv      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
+    Tagalog     fil     3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Tamil       ta      3  -      │ Yes   -     -     -     Yes   Yes   -     -
+    Turkish     tr      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
+    Ukrainian   uk      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
+    Urdu        ur      3  -      │ Yes   -     -     -     Yes   Yes   -     -
+    Vietnamese  vi      3  -      │ Yes   Yes   -     -     Yes   -     -     -

 [1] Bosnian, Croatian, and Serbian use the same underlying word list, because
 they share most of their vocabulary and grammar, they were once considered the
@ -232,7 +240,7 @@ the list, in descending frequency order.

    >>> from wordfreq import top_n_list
    >>> top_n_list('en', 10)
-    ['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'for', 'that']
+    ['the', 'to', 'and', 'of', 'a', 'in', 'i', 'is', 'for', 'that']

    >>> top_n_list('es', 10)
    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'un']
@ -302,16 +310,16 @@ tokenized according to this function.
    >>> tokenize('l@s niñ@s', 'es')
    ['l@s', 'niñ@s']
    >>> zipf_frequency('l@s', 'es')
-    2.82
+    3.03

 Because tokenization in the real world is far from consistent, wordfreq will
 also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

    >>> zipf_frequency('New York', 'en')
-    5.3
+    5.32
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.23
+    3.29

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -326,7 +334,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

    >>> zipf_frequency('owl-flavored', 'en')
-    3.29
+    3.3


 ## Multi-script languages
@ -387,7 +395,7 @@ the 'cjk' feature:
    pip install wordfreq[cjk]

 Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
-on `mecab-python` and `ipadic`, and tokenizing Korean depends on `mecab-python`
+on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
 and `mecab-ko-dic`.

 As of version 2.4.2, you no longer have to install dictionaries separately.
@ -523,6 +531,12 @@ The same citation in BibTex format:
  International Conference on Language Resources and Evaluation (LREC 2016).
  http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf

+- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
+  for processing huge corpora on medium to low resource infrastructures. In
+  Proceedings of the Workshop on Challenges in the Management of Large Corpora
+  (CMLC-7) 2019.
+  https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
+
 - ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
  European Languages. https://paracrawl.eu/

--- a/setup.py
+++ b/setup.py
@ -33,7 +33,7 @@ dependencies = [

 setup(
    name="wordfreq",
-    version='2.4.2',
+    version='2.5.0',
    maintainer='Robyn Speer',
    maintainer_email='rspeer@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',
@ -49,9 +49,8 @@ setup(
    install_requires=dependencies,

    # mecab-python3 is required for looking up Japanese or Korean word
-    # frequencies. In turn, it depends on libmecab-dev being installed on the
-    # system. It's not listed under 'install_requires' because wordfreq should
-    # be usable in other languages without it.
+    # frequencies. It's not listed under 'install_requires' because wordfreq
+    # should be usable in other languages without it.
    #
    # Similarly, jieba is required for Chinese word frequencies.
    extras_require={
--- a/tests/test_general.py
+++ b/tests/test_general.py
@ -60,18 +60,45 @@ def test_most_common_words():
        return top_n_list(lang, 1)[0]

    assert get_most_common('ar') == 'في'
+    assert get_most_common('bg') == 'на'
+    assert get_most_common('bn') == 'না'
+    assert get_most_common('ca') == 'de'
    assert get_most_common('cs') == 'a'
+    assert get_most_common('da') == 'i'
+    assert get_most_common('el') == 'και'
    assert get_most_common('de') == 'die'
    assert get_most_common('en') == 'the'
    assert get_most_common('es') == 'de'
+    assert get_most_common('fi') == 'ja'
+    assert get_most_common('fil') == 'sa'
    assert get_most_common('fr') == 'de'
+    assert get_most_common('he') == 'את'
+    assert get_most_common('hi') == 'के'
+    assert get_most_common('hu') == 'a'
+    assert get_most_common('id') == 'yang'
+    assert get_most_common('is') == 'og'
    assert get_most_common('it') == 'di'
    assert get_most_common('ja') == 'の'
+    assert get_most_common('ko') == '이'
+    assert get_most_common('lt') == 'ir'
+    assert get_most_common('lv') == 'un'
+    assert get_most_common('mk') == 'на'
+    assert get_most_common('ms') == 'yang'
+    assert get_most_common('nb') == 'i'
    assert get_most_common('nl') == 'de'
    assert get_most_common('pl') == 'w'
    assert get_most_common('pt') == 'de'
+    assert get_most_common('ro') == 'de'
    assert get_most_common('ru') == 'в'
-    assert get_most_common('tr') == 'bir'
+    assert get_most_common('sh') == 'je'
+    assert get_most_common('sk') == 'a'
+    assert get_most_common('sl') == 'je'
+    assert get_most_common('sv') == 'är'
+    assert get_most_common('ta') == 'ஒரு'
+    assert get_most_common('tr') == 've'
+    assert get_most_common('uk') == 'в'
+    assert get_most_common('ur') == 'کے'
+    assert get_most_common('vi') == 'là'
    assert get_most_common('zh') == '的'


--- a/wordfreq/data/jieba_zh.txt
+++ b/wordfreq/data/jieba_zh.txt
--- a/wordfreq/data/large_ar.msgpack.gz
+++ b/wordfreq/data/large_ar.msgpack.gz
--- a/wordfreq/data/large_bn.msgpack.gz
+++ b/wordfreq/data/large_bn.msgpack.gz
--- a/wordfreq/data/large_ca.msgpack.gz
+++ b/wordfreq/data/large_ca.msgpack.gz
--- a/wordfreq/data/large_cs.msgpack.gz
+++ b/wordfreq/data/large_cs.msgpack.gz
--- a/wordfreq/data/large_de.msgpack.gz
+++ b/wordfreq/data/large_de.msgpack.gz
--- a/wordfreq/data/large_en.msgpack.gz
+++ b/wordfreq/data/large_en.msgpack.gz
--- a/wordfreq/data/large_es.msgpack.gz
+++ b/wordfreq/data/large_es.msgpack.gz
--- a/wordfreq/data/large_fi.msgpack.gz
+++ b/wordfreq/data/large_fi.msgpack.gz
--- a/wordfreq/data/large_fr.msgpack.gz
+++ b/wordfreq/data/large_fr.msgpack.gz
--- a/wordfreq/data/large_he.msgpack.gz
+++ b/wordfreq/data/large_he.msgpack.gz
--- a/wordfreq/data/large_it.msgpack.gz
+++ b/wordfreq/data/large_it.msgpack.gz
--- a/wordfreq/data/large_ja.msgpack.gz
+++ b/wordfreq/data/large_ja.msgpack.gz
--- a/wordfreq/data/large_mk.msgpack.gz
+++ b/wordfreq/data/large_mk.msgpack.gz
--- a/wordfreq/data/large_nb.msgpack.gz
+++ b/wordfreq/data/large_nb.msgpack.gz
--- a/wordfreq/data/large_nl.msgpack.gz
+++ b/wordfreq/data/large_nl.msgpack.gz
--- a/wordfreq/data/large_pl.msgpack.gz
+++ b/wordfreq/data/large_pl.msgpack.gz
--- a/wordfreq/data/large_pt.msgpack.gz
+++ b/wordfreq/data/large_pt.msgpack.gz
--- a/wordfreq/data/large_ru.msgpack.gz
+++ b/wordfreq/data/large_ru.msgpack.gz
--- a/wordfreq/data/large_sv.msgpack.gz
+++ b/wordfreq/data/large_sv.msgpack.gz
--- a/wordfreq/data/large_uk.msgpack.gz
+++ b/wordfreq/data/large_uk.msgpack.gz
--- a/wordfreq/data/large_zh.msgpack.gz
+++ b/wordfreq/data/large_zh.msgpack.gz
--- a/wordfreq/data/small_ar.msgpack.gz
+++ b/wordfreq/data/small_ar.msgpack.gz
--- a/wordfreq/data/small_bg.msgpack.gz
+++ b/wordfreq/data/small_bg.msgpack.gz
--- a/wordfreq/data/small_bn.msgpack.gz
+++ b/wordfreq/data/small_bn.msgpack.gz
--- a/wordfreq/data/small_ca.msgpack.gz
+++ b/wordfreq/data/small_ca.msgpack.gz
--- a/wordfreq/data/small_cs.msgpack.gz
+++ b/wordfreq/data/small_cs.msgpack.gz
--- a/wordfreq/data/small_da.msgpack.gz
+++ b/wordfreq/data/small_da.msgpack.gz
--- a/wordfreq/data/small_de.msgpack.gz
+++ b/wordfreq/data/small_de.msgpack.gz
--- a/wordfreq/data/small_el.msgpack.gz
+++ b/wordfreq/data/small_el.msgpack.gz
--- a/wordfreq/data/small_en.msgpack.gz
+++ b/wordfreq/data/small_en.msgpack.gz
--- a/wordfreq/data/small_es.msgpack.gz
+++ b/wordfreq/data/small_es.msgpack.gz
--- a/wordfreq/data/small_fa.msgpack.gz
+++ b/wordfreq/data/small_fa.msgpack.gz
--- a/wordfreq/data/small_fi.msgpack.gz
+++ b/wordfreq/data/small_fi.msgpack.gz
--- a/wordfreq/data/small_fil.msgpack.gz
+++ b/wordfreq/data/small_fil.msgpack.gz
--- a/wordfreq/data/small_fr.msgpack.gz
+++ b/wordfreq/data/small_fr.msgpack.gz
--- a/wordfreq/data/small_he.msgpack.gz
+++ b/wordfreq/data/small_he.msgpack.gz
--- a/wordfreq/data/small_hi.msgpack.gz
+++ b/wordfreq/data/small_hi.msgpack.gz
--- a/wordfreq/data/small_hu.msgpack.gz
+++ b/wordfreq/data/small_hu.msgpack.gz
--- a/wordfreq/data/small_id.msgpack.gz
+++ b/wordfreq/data/small_id.msgpack.gz
--- a/wordfreq/data/small_is.msgpack.gz
+++ b/wordfreq/data/small_is.msgpack.gz
--- a/wordfreq/data/small_it.msgpack.gz
+++ b/wordfreq/data/small_it.msgpack.gz
--- a/wordfreq/data/small_ja.msgpack.gz
+++ b/wordfreq/data/small_ja.msgpack.gz
--- a/wordfreq/data/small_ko.msgpack.gz
+++ b/wordfreq/data/small_ko.msgpack.gz
--- a/wordfreq/data/small_lt.msgpack.gz
+++ b/wordfreq/data/small_lt.msgpack.gz
--- a/wordfreq/data/small_lv.msgpack.gz
+++ b/wordfreq/data/small_lv.msgpack.gz
--- a/wordfreq/data/small_mk.msgpack.gz
+++ b/wordfreq/data/small_mk.msgpack.gz
--- a/wordfreq/data/small_ms.msgpack.gz
+++ b/wordfreq/data/small_ms.msgpack.gz
--- a/wordfreq/data/small_nb.msgpack.gz
+++ b/wordfreq/data/small_nb.msgpack.gz
--- a/wordfreq/data/small_nl.msgpack.gz
+++ b/wordfreq/data/small_nl.msgpack.gz
--- a/wordfreq/data/small_pl.msgpack.gz
+++ b/wordfreq/data/small_pl.msgpack.gz
--- a/wordfreq/data/small_pt.msgpack.gz
+++ b/wordfreq/data/small_pt.msgpack.gz
--- a/wordfreq/data/small_ro.msgpack.gz
+++ b/wordfreq/data/small_ro.msgpack.gz
--- a/wordfreq/data/small_ru.msgpack.gz
+++ b/wordfreq/data/small_ru.msgpack.gz
--- a/wordfreq/data/small_sh.msgpack.gz
+++ b/wordfreq/data/small_sh.msgpack.gz
--- a/wordfreq/data/small_sk.msgpack.gz
+++ b/wordfreq/data/small_sk.msgpack.gz
--- a/wordfreq/data/small_sl.msgpack.gz
+++ b/wordfreq/data/small_sl.msgpack.gz
--- a/wordfreq/data/small_sv.msgpack.gz
+++ b/wordfreq/data/small_sv.msgpack.gz
--- a/wordfreq/data/small_ta.msgpack.gz
+++ b/wordfreq/data/small_ta.msgpack.gz
--- a/wordfreq/data/small_tr.msgpack.gz
+++ b/wordfreq/data/small_tr.msgpack.gz
--- a/wordfreq/data/small_uk.msgpack.gz
+++ b/wordfreq/data/small_uk.msgpack.gz
--- a/wordfreq/data/small_ur.msgpack.gz
+++ b/wordfreq/data/small_ur.msgpack.gz
--- a/wordfreq/data/small_vi.msgpack.gz
+++ b/wordfreq/data/small_vi.msgpack.gz
--- a/wordfreq/data/small_zh.msgpack.gz
+++ b/wordfreq/data/small_zh.msgpack.gz