mirror of
https://github.com/rspeer/wordfreq.git
synced 2025-01-14 21:25:58 +00:00
update the README for Chinese
This commit is contained in:
parent
2327f2e4d6
commit
d576e3294b
72
README.md
72
README.md
@ -111,47 +111,50 @@ limiting the selection to words that can be typed in ASCII.
|
|||||||
|
|
||||||
## Sources and supported languages
|
## Sources and supported languages
|
||||||
|
|
||||||
We compiled word frequencies from five different sources, providing us examples
|
We compiled word frequencies from seven different sources, providing us
|
||||||
of word usage on different topics at different levels of formality. The sources
|
examples of word usage on different topics at different levels of formality.
|
||||||
(and the abbreviations we'll use for them) are:
|
The sources (and the abbreviations we'll use for them) are:
|
||||||
|
|
||||||
- **GBooks**: Google Books Ngrams 2013
|
|
||||||
- **LeedsIC**: The Leeds Internet Corpus
|
- **LeedsIC**: The Leeds Internet Corpus
|
||||||
- **SUBTLEX**: The SUBTLEX word frequency lists
|
- **SUBTLEX**: The SUBTLEX word frequency lists
|
||||||
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
||||||
- **Twitter**: Messages sampled from Twitter's public stream
|
- **Twitter**: Messages sampled from Twitter's public stream
|
||||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
- **Wpedia**: The full text of Wikipedia in 2015
|
||||||
|
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||||
|
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||||
|
comes with the Jieba tokenizer.
|
||||||
|
|
||||||
The following 12 languages are well-supported, with reasonable tokenization and
|
|
||||||
|
The following 17 languages are well-supported, with reasonable tokenization and
|
||||||
at least 3 different sources of word frequencies:
|
at least 3 different sources of word frequencies:
|
||||||
|
|
||||||
Language Code GBooks SUBTLEX OpenSub LeedsIC Twitter Wikipedia
|
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
|
||||||
──────────────────┼──────────────────────────────────────────────────
|
──────────────────┼─────────────────────────────────────────────────────
|
||||||
Arabic ar │ - - Yes Yes Yes Yes
|
Arabic ar │ - Yes Yes Yes Yes -
|
||||||
German de │ - Yes - Yes Yes[1] Yes
|
German de │ Yes - Yes Yes[1] Yes -
|
||||||
Greek el │ - - Yes Yes Yes Yes
|
Greek el │ - Yes Yes Yes Yes -
|
||||||
English en │ Yes Yes Yes Yes Yes Yes
|
English en │ Yes Yes Yes Yes Yes Google Books
|
||||||
Spanish es │ - - Yes Yes Yes Yes
|
Spanish es │ - Yes Yes Yes Yes -
|
||||||
French fr │ - - Yes Yes Yes Yes
|
French fr │ - Yes Yes Yes Yes -
|
||||||
Indonesian id │ - - Yes - Yes Yes
|
Indonesian id │ - Yes - Yes Yes -
|
||||||
Italian it │ - - Yes Yes Yes Yes
|
Italian it │ - Yes Yes Yes Yes -
|
||||||
Japanese ja │ - - - Yes Yes Yes
|
Japanese ja │ - - Yes Yes Yes -
|
||||||
Malay ms │ - - Yes - Yes Yes
|
Malay ms │ - Yes - Yes Yes -
|
||||||
Dutch nl │ - Yes Yes - Yes Yes
|
Dutch nl │ Yes Yes - Yes Yes -
|
||||||
Polish pl │ - - Yes - Yes Yes
|
Polish pl │ - Yes - Yes Yes -
|
||||||
Portuguese pt │ - - Yes Yes Yes Yes
|
Portuguese pt │ - Yes Yes Yes Yes -
|
||||||
Russian ru │ - - Yes Yes Yes Yes
|
Russian ru │ - Yes Yes Yes Yes -
|
||||||
Swedish sv │ - - Yes - Yes Yes
|
Swedish sv │ - Yes - Yes Yes -
|
||||||
Turkish tr │ - - Yes - Yes Yes
|
Turkish tr │ - Yes - Yes Yes -
|
||||||
|
Chinese zh │ Yes Yes Yes - - Jieba
|
||||||
|
|
||||||
These languages are only marginally supported so far. We have too few data
|
|
||||||
sources so far in Korean (feel free to suggest some), and we are lacking
|
|
||||||
tokenization support for Chinese.
|
|
||||||
|
|
||||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
Additionally, Korean is marginally supported. You can look up frequencies in
|
||||||
──────────────────┼──────────────────────────────────────────────────
|
it, but we have too few data sources for it so far:
|
||||||
Korean ko │ - - - - Yes Yes
|
|
||||||
Chinese zh │ - Yes Yes Yes - -
|
Language Code SUBTLEX LeedsIC OpenSub Twitter Wpedia
|
||||||
|
──────────────────┼───────────────────────────────────────
|
||||||
|
Korean ko │ - - - Yes Yes
|
||||||
|
|
||||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||||
you should be aware that German is not a frequently-used language on Twitter.
|
you should be aware that German is not a frequently-used language on Twitter.
|
||||||
@ -172,7 +175,8 @@ There are language-specific exceptions:
|
|||||||
- In Japanese, instead of using the regex library, it uses the external library
|
- In Japanese, instead of using the regex library, it uses the external library
|
||||||
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
|
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
|
||||||
it requires the `libmecab-dev` system package to be installed.
|
it requires the `libmecab-dev` system package to be installed.
|
||||||
- It does not yet attempt to tokenize Chinese ideograms.
|
- In Chinese, it uses the external Python library `jieba`, another optional
|
||||||
|
dependency.
|
||||||
|
|
||||||
[uax29]: http://unicode.org/reports/tr29/
|
[uax29]: http://unicode.org/reports/tr29/
|
||||||
|
|
||||||
@ -184,7 +188,9 @@ also try to deal gracefully when you query it with texts that actually break
|
|||||||
into multiple tokens:
|
into multiple tokens:
|
||||||
|
|
||||||
>>> word_frequency('New York', 'en')
|
>>> word_frequency('New York', 'en')
|
||||||
0.0002632772081925718
|
0.0002315934248950231
|
||||||
|
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
|
2.342123813395707e-05
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be.
|
to provide an estimate of what their combined frequency would be.
|
||||||
|
Loading…
Reference in New Issue
Block a user