update the README for Chinese

2025-01-14 21:25:58 +00:00 · 2015-09-05 03:42:54 -04:00 · 2015-09-05 03:42:54 -04:00 · d576e3294b
commit d576e3294b
parent 2327f2e4d6
1 changed files with 39 additions and 33 deletions
--- a/README.md
+++ b/README.md
@ -111,47 +111,50 @@ limiting the selection to words that can be typed in ASCII.
 ## Sources and supported languages
-We compiled word frequencies from five different sources, providing us examples
+We compiled word frequencies from seven different sources, providing us
-of word usage on different topics at different levels of formality. The sources
+examples of word usage on different topics at different levels of formality.
-(and the abbreviations we'll use for them) are:
+The sources (and the abbreviations we'll use for them) are:
 - **GBooks**: Google Books Ngrams 2013
 - **LeedsIC**: The Leeds Internet Corpus
 - **SUBTLEX**: The SUBTLEX word frequency lists
 - **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
 - **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015
+- **Wpedia**: The full text of Wikipedia in 2015
 - **Other**: We get additional English frequencies from Google Books Syntactic
  Ngrams 2013, and Chinese frequencies from the frequency dictionary that
  comes with the Jieba tokenizer.
-The following 12 languages are well-supported, with reasonable tokenization and
+
 The following 17 languages are well-supported, with reasonable tokenization and
 at least 3 different sources of word frequencies:
-    Language    Code    GBooks  SUBTLEX OpenSub LeedsIC Twitter Wikipedia
+    Language    Code    SUBTLEX OpenSub LeedsIC Twitter Wpedia  Other
-    ──────────────────┼──────────────────────────────────────────────────
+    ──────────────────┼─────────────────────────────────────────────────────
-    Arabic      ar    │ -       -       Yes     Yes     Yes     Yes
+    Arabic      ar    │ -       Yes     Yes     Yes     Yes     -
-    German      de    │ -       Yes     -       Yes     Yes[1]  Yes
+    German      de    │ Yes     -       Yes     Yes[1]  Yes     -
-    Greek       el    │ -       -       Yes     Yes     Yes     Yes
+    Greek       el    │ -       Yes     Yes     Yes     Yes     -
-    English     en    │ Yes     Yes     Yes     Yes     Yes     Yes
+    English     en    │ Yes     Yes     Yes     Yes     Yes     Google Books
-    Spanish     es    │ -       -       Yes     Yes     Yes     Yes
+    Spanish     es    │ -       Yes     Yes     Yes     Yes     -
-    French      fr    │ -       -       Yes     Yes     Yes     Yes
+    French      fr    │ -       Yes     Yes     Yes     Yes     -
-    Indonesian  id    │ -       -       Yes     -       Yes     Yes
+    Indonesian  id    │ -       Yes     -       Yes     Yes     -
-    Italian     it    │ -       -       Yes     Yes     Yes     Yes
+    Italian     it    │ -       Yes     Yes     Yes     Yes     -
-    Japanese    ja    │ -       -       -       Yes     Yes     Yes
+    Japanese    ja    │ -       -       Yes     Yes     Yes     -
-    Malay       ms    │ -       -       Yes     -       Yes     Yes
+    Malay       ms    │ -       Yes     -       Yes     Yes     -
-    Dutch       nl    │ -       Yes     Yes     -       Yes     Yes
+    Dutch       nl    │ Yes     Yes     -       Yes     Yes     -
-    Polish      pl    │ -       -       Yes     -       Yes     Yes
+    Polish      pl    │ -       Yes     -       Yes     Yes     -
-    Portuguese  pt    │ -       -       Yes     Yes     Yes     Yes
+    Portuguese  pt    │ -       Yes     Yes     Yes     Yes     -
-    Russian     ru    │ -       -       Yes     Yes     Yes     Yes
+    Russian     ru    │ -       Yes     Yes     Yes     Yes     -
-    Swedish     sv    │ -       -       Yes     -       Yes     Yes
+    Swedish     sv    │ -       Yes     -       Yes     Yes     -
-    Turkish     tr    │ -       -       Yes     -       Yes     Yes
+    Turkish     tr    │ -       Yes     -       Yes     Yes     -
    Chinese     zh    │ Yes     Yes     Yes     -       -       Jieba
 These languages are only marginally supported so far. We have too few data
 sources so far in Korean (feel free to suggest some), and we are lacking
 tokenization support for Chinese.
-    Language    Code    GBooks  SUBTLEX LeedsIC OpenSub Twitter Wikipedia
+Additionally, Korean is marginally supported. You can look up frequencies in
-    ──────────────────┼──────────────────────────────────────────────────
+it, but we have too few data sources for it so far:
-    Korean      ko    │ -       -       -       -       Yes     Yes
+
-    Chinese     zh    │ -       Yes     Yes     Yes     -       -
+    Language    Code    SUBTLEX LeedsIC OpenSub Twitter Wpedia
    ──────────────────┼───────────────────────────────────────
    Korean      ko    │ -       -       -       Yes     Yes
 [1] We've counted the frequencies from tweets in German, such as they are, but
 you should be aware that German is not a frequently-used language on Twitter.
@ -172,7 +175,8 @@ There are language-specific exceptions:
 - In Japanese, instead of using the regex library, it uses the external library
  `mecab-python3`. This is an optional dependency of wordfreq, and compiling
  it requires the `libmecab-dev` system package to be installed.
- It does not yet attempt to tokenize Chinese ideograms.
+- In Chinese, it uses the external Python library `jieba`, another optional
  dependency.
 [uax29]: http://unicode.org/reports/tr29/
@ -184,7 +188,9 @@ also try to deal gracefully when you query it with texts that actually break
 into multiple tokens:
    >>> word_frequency('New York', 'en')
-    0.0002632772081925718
+    0.0002315934248950231
    >>> word_frequency('北京地铁', 'zh')  # "Beijing Subway"
    2.342123813395707e-05
 The word frequencies are combined with the half-harmonic-mean function in order
 to provide an estimate of what their combined frequency would be.