update the README for Chinese

This commit is contained in:
Rob Speer 2015-09-05 03:42:54 -04:00
parent 2327f2e4d6
commit d576e3294b

View File

@ -111,47 +111,50 @@ limiting the selection to words that can be typed in ASCII.
## Sources and supported languages ## Sources and supported languages
We compiled word frequencies from five different sources, providing us examples We compiled word frequencies from seven different sources, providing us
of word usage on different topics at different levels of formality. The sources examples of word usage on different topics at different levels of formality.
(and the abbreviations we'll use for them) are: The sources (and the abbreviations we'll use for them) are:
- **GBooks**: Google Books Ngrams 2013
- **LeedsIC**: The Leeds Internet Corpus - **LeedsIC**: The Leeds Internet Corpus
- **SUBTLEX**: The SUBTLEX word frequency lists - **SUBTLEX**: The SUBTLEX word frequency lists
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX - **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
- **Twitter**: Messages sampled from Twitter's public stream - **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015 - **Wpedia**: The full text of Wikipedia in 2015
- **Other**: We get additional English frequencies from Google Books Syntactic
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
comes with the Jieba tokenizer.
The following 12 languages are well-supported, with reasonable tokenization and
The following 17 languages are well-supported, with reasonable tokenization and
at least 3 different sources of word frequencies: at least 3 different sources of word frequencies:
Language Code GBooks SUBTLEX OpenSub LeedsIC Twitter Wikipedia Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
──────────────────┼────────────────────────────────────────────────── ──────────────────┼─────────────────────────────────────────────────────
Arabic ar │ - - Yes Yes Yes Yes Arabic ar │ - Yes Yes Yes Yes -
German de │ - Yes - Yes Yes[1] Yes German de │ Yes - Yes Yes[1] Yes -
Greek el │ - - Yes Yes Yes Yes Greek el │ - Yes Yes Yes Yes -
English en │ Yes Yes Yes Yes Yes Yes English en │ Yes Yes Yes Yes Yes Google Books
Spanish es │ - - Yes Yes Yes Yes Spanish es │ - Yes Yes Yes Yes -
French fr │ - - Yes Yes Yes Yes French fr │ - Yes Yes Yes Yes -
Indonesian id │ - - Yes - Yes Yes Indonesian id │ - Yes - Yes Yes -
Italian it │ - - Yes Yes Yes Yes Italian it │ - Yes Yes Yes Yes -
Japanese ja │ - - - Yes Yes Yes Japanese ja │ - - Yes Yes Yes -
Malay ms │ - - Yes - Yes Yes Malay ms │ - Yes - Yes Yes -
Dutch nl │ - Yes Yes - Yes Yes Dutch nl │ Yes Yes - Yes Yes -
Polish pl │ - - Yes - Yes Yes Polish pl │ - Yes - Yes Yes -
Portuguese pt │ - - Yes Yes Yes Yes Portuguese pt │ - Yes Yes Yes Yes -
Russian ru │ - - Yes Yes Yes Yes Russian ru │ - Yes Yes Yes Yes -
Swedish sv │ - - Yes - Yes Yes Swedish sv │ - Yes - Yes Yes -
Turkish tr │ - - Yes - Yes Yes Turkish tr │ - Yes - Yes Yes -
Chinese zh │ Yes Yes Yes - - Jieba
These languages are only marginally supported so far. We have too few data
sources so far in Korean (feel free to suggest some), and we are lacking
tokenization support for Chinese.
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia Additionally, Korean is marginally supported. You can look up frequencies in
──────────────────┼────────────────────────────────────────────────── it, but we have too few data sources for it so far:
Korean ko │ - - - - Yes Yes
Chinese zh │ - Yes Yes Yes - - Language Code SUBTLEX LeedsIC OpenSub Twitter Wpedia
──────────────────┼───────────────────────────────────────
Korean ko │ - - - Yes Yes
[1] We've counted the frequencies from tweets in German, such as they are, but [1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter. you should be aware that German is not a frequently-used language on Twitter.
@ -172,7 +175,8 @@ There are language-specific exceptions:
- In Japanese, instead of using the regex library, it uses the external library - In Japanese, instead of using the regex library, it uses the external library
`mecab-python3`. This is an optional dependency of wordfreq, and compiling `mecab-python3`. This is an optional dependency of wordfreq, and compiling
it requires the `libmecab-dev` system package to be installed. it requires the `libmecab-dev` system package to be installed.
- It does not yet attempt to tokenize Chinese ideograms. - In Chinese, it uses the external Python library `jieba`, another optional
dependency.
[uax29]: http://unicode.org/reports/tr29/ [uax29]: http://unicode.org/reports/tr29/
@ -184,7 +188,9 @@ also try to deal gracefully when you query it with texts that actually break
into multiple tokens: into multiple tokens:
>>> word_frequency('New York', 'en') >>> word_frequency('New York', 'en')
0.0002632772081925718 0.0002315934248950231
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
2.342123813395707e-05
The word frequencies are combined with the half-harmonic-mean function in order The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. to provide an estimate of what their combined frequency would be.