Update documentation and bump version to 1.6

2024-12-23 17:31:41 +00:00 · 2017-01-05 19:18:06 -05:00 · 2017-01-05 19:18:06 -05:00 · 39e459ac71
commit 39e459ac71
parent 23c7c8e936
3 changed files with 102 additions and 54 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,3 +1,21 @@
+## Version 1.6.0 (2017-01-05)
+
+- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
+- Add large lists in Chinese, Finnish, Japanese, and Polish
+- Data is now collected and built using Exquisite Corpus
+  (https://github.com/rspeer/exquisite-corpus)
+- Add word frequencies from OPUS OpenSubtitles 2016
+- Add word frequencies from the MOKK Hungarian Webcorpus
+- Expand Google Books Ngrams data to cover 8 languages
+- Expand language detection on Reddit to cover 13 languages with large enough
+  Reddit communities
+- Drop the Common Crawl; we have enough good sources now that we don't have
+  to deal with all that spam
+- Add automatic transliteration of Serbian text
+- Another new frequency-merging strategy (drop the highest and lowest,
+  average the rest)
+
+
 ## Version 1.5.1 (2016-08-19)

 - Bug fix: Made it possible to load the Japanese or Korean dictionary when the
--- a/README.md
+++ b/README.md
@ -205,65 +205,83 @@ limiting the selection to words that can be typed in ASCII.

 ## Sources and supported languages

-We compiled word frequencies from seven different sources, providing us
-examples of word usage on different topics at different levels of formality.
-The sources (and the abbreviations we'll use for them) are:
+This data comes from a Luminoso project called [Exquisite Corpus][xc], whose
+goal is to download good, varied, multilingual corpus data, process it
+appropriately, and combine it into unified resources such as wordfreq.

- **LeedsIC**: The Leeds Internet Corpus
- **SUBTLEX**: The SUBTLEX word frequency lists
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
- **Twitter**: Messages sampled from Twitter's public stream
- **Wpedia**: The full text of Wikipedia in 2015
- **Reddit**: The corpus of Reddit comments through May 2015
- **CCrawl**: Text extracted from the Common Crawl and language-detected with cld2
- **Other**: We get additional English frequencies from Google Books Syntactic
-  Ngrams 2013, and Chinese frequencies from the frequency dictionary that
-  comes with the Jieba tokenizer.
+[xc]: https://github.com/rspeer/exquisite-corpus

-The following 27 languages are supported, with reasonable tokenization and at
+Exquisite Corpus compiles 8 different domains of text, some of which themselves
+come from multiple sources:
+
+- **Wikipedia**, representing encyclopedic text
+- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX
+- **News**, from NewsCrawl 2014 and GlobalVoices
+- **Books**, from Google Books Ngrams 2012
+- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus
+- **Twitter**, representing short-form social media
+- **Reddit**, representing potentially longer Internet comments
+- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
+  that comes with the Jieba word segmenter, whose provenance we don't really know
+
+The following languages are supported, with reasonable tokenization and at
 least 3 different sources of word frequencies:

-    Language    Code    Sources Large?   SUBTLEX OpenSub LeedsIC Twitter Wpedia  CCrawl  Reddit  Other
-    ───────────────────────────────────┼──────────────────────────────────────────────────────────────
-    Arabic      ar      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
-    Bulgarian   bg      3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    Catalan     ca      3       -      │ -       Yes     -       Yes     Yes     -       -       -
-    Danish      da      3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    German      de      5       Yes    │ Yes     -       Yes     Yes     Yes     Yes     -       -
-    Greek       el      4       -      │ -       Yes     Yes     -       Yes     Yes     -       -
-    English     en      7       Yes    │ Yes     Yes     Yes     Yes     Yes     -       Yes     Google Books
-    Spanish     es      6       Yes    │ -       Yes     Yes     Yes     Yes     Yes     Yes     -
-    Finnish     fi      3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    French      fr      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
-    Hebrew      he      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Hindi       hi      3       -      │ -       -       -       Yes     Yes     Yes     -       -
-    Hungarian   hu      3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    Indonesian  id      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Italian     it      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
-    Japanese    ja      4       -      │ -       -       Yes     Yes     Yes     Yes     -       -
-    Korean      ko      3       -      │ -       -       -       Yes     Yes     Yes     -       -
-    Malay       ms      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Norwegian   nb[1]   3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    Dutch       nl      5       Yes    │ Yes     Yes     -       Yes     Yes     Yes     -       -
-    Polish      pl      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Portuguese  pt      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
-    Romanian    ro      3       -      │ -       Yes     -       -       Yes     Yes     -       -
-    Russian     ru      5       Yes    │ -       Yes     Yes     Yes     Yes     Yes     -       -
-    Swedish     sv      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Turkish     tr      4       -      │ -       Yes     -       Yes     Yes     Yes     -       -
-    Chinese     zh[2]   5       -      │ Yes     -       Yes     -       Yes     Yes     -       Jieba
+    Language    Code    #  Large?   WP    Subs  News  Books Web   Twit. Redd. Misc.
+    ──────────────────────────────┼────────────────────────────────────────────────
+    Arabic      ar      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
+    Bosnian     bs[1]   3         │ Yes   Yes   -     -     -     Yes   -     -
+    Bulgarian   bg      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Catalan     ca      4  -      │ Yes   Yes   Yes   -     -     Yes   -     -
+    Czech       cs      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Danish      da      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    German      de      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Greek       el      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    English     en      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Spanish     es      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Persian     fa      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Finnish     fi      5  Yes    │ Yes   Yes   Yes   -     -     Yes   Yes   -
+    French      fr      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Hebrew      he      4  -      │ Yes   Yes   -     Yes   -     Yes   -     -
+    Hindi       hi      3  -      │ Yes   -     -     -     -     Yes   Yes   -
+    Croatian    hr[1]   3         │ Yes   Yes   -     -     -     Yes   -     -
+    Hungarian   hu      3  -      │ Yes   Yes   -     -     Yes   -     -     -
+    Indonesian  id      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Italian     it      7  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   Yes   -
+    Japanese    ja      5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
+    Korean      ko      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Norwegian   nb[2]   4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Dutch       nl      4  Yes    │ Yes   Yes   Yes   -     -     Yes   -     -
+    Polish      pl      5  Yes    │ Yes   Yes   Yes   -     -     Yes   Yes   -
+    Portuguese  pt      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
+    Romanian    ro      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Russian     ru      6  Yes    │ Yes   Yes   Yes   Yes   Yes   Yes   -     -
+    Serbian     sr[1]   3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Swedish     sv      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Turkish     tr      3  -      │ Yes   Yes   -     -     -     Yes   -     -
+    Ukrainian   uk      4  -      │ Yes   Yes   -     -     -     Yes   Yes   -
+    Chinese     zh[3]   6  Yes    │ Yes   -     Yes   Yes   Yes   Yes   -     Jieba

-[1] The Norwegian text we have is specifically written in Norwegian Bokmål, so
+[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
+they are mutually intelligible and have large amounts of vocabulary in common.
+This word list can also be accessed with the language code `sh`.
+We list them separately to emphasize that the word list is appropriate for
+looking up frequencies in any of those languages, even though the idea of a
+unified Serbo-Croatian language is losing popularity. Lookups in `sr` or `sh`
+will also automatically unify Cyrillic and Latin spellings.
+
+[2] The Norwegian text we have is specifically written in Norwegian Bokmål, so
 we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
 isn't enough data to include it in wordfreq.

-[2] This data represents text written in both Simplified and Traditional
-Chinese. (SUBTLEX is mostly Simplified, while Wikipedia is mostly Traditional.)
-The characters are mapped to one another so they can use the same word
-frequency list.
+[3] This data represents text written in both Simplified and Traditional
+Chinese. (SUBTLEX is mostly Simplified, for example, while Wikipedia is mostly
+Traditional.) The characters are mapped to one another so they can use the same
+underlying word frequency list.

 Some languages provide 'large' wordlists, including words with a Zipf frequency
-between 1.0 and 3.0. These are available in 9 languages that are covered by
+between 1.0 and 3.0. These are available in 12 languages that are covered by
 enough data sources.


@ -298,9 +316,9 @@ also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:

    >>> zipf_frequency('New York', 'en')
-    5.07
+    5.31
    >>> zipf_frequency('北京地铁', 'zh')  # "Beijing Subway"
-    3.58
+    3.56

 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be. In Chinese,
@ -315,7 +333,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
 their frequency:

    >>> zipf_frequency('owl-flavored', 'en')
-    3.19
+    3.26


 ## License
@ -371,7 +389,8 @@ If you use wordfreq in your research, please cite it! We publish the code
 through Zenodo so that it can be reliably cited using a DOI. The current
 citation is:

-> Robyn Speer, Joshua Chin, Andrew Lin, Lance Nathan, & Sara Jewett. (2016). wordfreq: v1.5.1 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.61937
+> Robyn Speer, Joshua Chin, Andrew Lin, Lance Nathan, & Sara Jewett. (2016).
+> wordfreq: v1.5.1 [Data set]. Zenodo. http://doi.org/10.5281/zenodo.61937

 The same citation in BibTex format:

@ -393,6 +412,12 @@ The same citation in BibTex format:

 ## Citations to work that wordfreq is built on

+- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
+  Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
+  Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
+  Machine Translation.
+  http://www.statmt.org/wmt15/results.html
+
 - Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
  Evaluation of Current Word Frequency Norms and the Introduction of a New and
  Improved Word Frequency Measure for American English. Behavior Research
@ -418,6 +443,11 @@ The same citation in BibTex format:
 - Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
  http://unicode.org/reports/tr29/

+- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
+  (2004). Creating open language resources for Hungarian. In Proceedings of the
+  4th international conference on Language Resources and Evaluation (LREC2004).
+  http://mokk.bme.hu/resources/webcorpus/
+
 - Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
  measure for Dutch words based on film subtitles. Behavior Research Methods,
  42(3), 643-650.
--- a/setup.py
+++ b/setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):

 setup(
    name="wordfreq",
-    version='1.5.2',
+    version='1.6',
    maintainer='Luminoso Technologies, Inc.',
    maintainer_email='info@luminoso.com',
    url='http://github.com/LuminosoInsight/wordfreq/',