mirror of
https://github.com/rspeer/wordfreq.git
synced 2025-01-13 12:45:58 +00:00
update README for 1.7; sort language list in English order
This commit is contained in:
parent
46e32fbd36
commit
fb4a7db6f7
163
README.md
163
README.md
@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:
|
|||||||
|
|
||||||
python3 setup.py install
|
python3 setup.py install
|
||||||
|
|
||||||
|
See [Additional CJK installation][#additional-cjk-installation] for extra
|
||||||
## Additional CJK installation
|
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||||
|
|
||||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
|
||||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
|
||||||
in increasing order of difficulty.
|
|
||||||
|
|
||||||
|
|
||||||
### Chinese
|
|
||||||
|
|
||||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
|
||||||
pure-Python Chinese tokenizer:
|
|
||||||
|
|
||||||
pip3 install jieba
|
|
||||||
|
|
||||||
|
|
||||||
### Japanese
|
|
||||||
|
|
||||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
|
||||||
things need to be installed:
|
|
||||||
|
|
||||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
|
||||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
|
||||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
|
||||||
* The `mecab-python3` Python interface
|
|
||||||
|
|
||||||
To install these three things on Ubuntu, you can run:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
|
||||||
pip3 install mecab-python3
|
|
||||||
```
|
|
||||||
|
|
||||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
|
||||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
|
||||||
give you nonsense results.
|
|
||||||
|
|
||||||
|
|
||||||
### Korean
|
|
||||||
|
|
||||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
|
||||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
|
||||||
|
|
||||||
Here's a process you can use to install the Korean dictionary and the other
|
|
||||||
MeCab dependencies:
|
|
||||||
|
|
||||||
```sh
|
|
||||||
sudo apt-get install libmecab-dev mecab-utils
|
|
||||||
pip3 install mecab-python3
|
|
||||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
|
||||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
|
||||||
cd mecab-ko-dic-2.0.1-20150920
|
|
||||||
./autogen.sh
|
|
||||||
make
|
|
||||||
sudo make install
|
|
||||||
```
|
|
||||||
|
|
||||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
|
||||||
tokenize those languages, it will raise an error and show you the list of
|
|
||||||
paths it searched.
|
|
||||||
|
|
||||||
Sorry that this is difficult. We tried to just package the data files we need
|
|
||||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
|
||||||
being too large.
|
|
||||||
|
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
@ -175,10 +113,10 @@ the list, in descending frequency order.
|
|||||||
|
|
||||||
>>> from wordfreq import top_n_list
|
>>> from wordfreq import top_n_list
|
||||||
>>> top_n_list('en', 10)
|
>>> top_n_list('en', 10)
|
||||||
['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||||
|
|
||||||
>>> top_n_list('es', 10)
|
>>> top_n_list('es', 10)
|
||||||
['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
|
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||||
|
|
||||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||||
wordlist, in descending frequency order.
|
wordlist, in descending frequency order.
|
||||||
@ -197,10 +135,12 @@ will select each random word from 2^n words.
|
|||||||
|
|
||||||
If you happen to want an easy way to get [a memorable, xkcd-style
|
If you happen to want an easy way to get [a memorable, xkcd-style
|
||||||
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
||||||
job. In this case, you should actually run the similar function `random_ascii_words`,
|
job. In this case, you should actually run the similar function
|
||||||
limiting the selection to words that can be typed in ASCII.
|
`random_ascii_words`, limiting the selection to words that can be typed in
|
||||||
|
ASCII. But maybe you should just use [xkpa][].
|
||||||
|
|
||||||
[xkcd936]: https://xkcd.com/936/
|
[xkcd936]: https://xkcd.com/936/
|
||||||
|
[xkpa]: https://github.com/beala/xkcd-password
|
||||||
|
|
||||||
|
|
||||||
## Sources and supported languages
|
## Sources and supported languages
|
||||||
@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
|
|||||||
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
||||||
──────────────────────────────┼────────────────────────────────────────────────
|
──────────────────────────────┼────────────────────────────────────────────────
|
||||||
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||||
Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
|
Bengali bn 3 - │ Yes - Yes - - Yes - -
|
||||||
|
Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
|
||||||
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
|
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
|
||||||
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
|
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
|
||||||
|
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||||
|
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||||
Czech cs 3 - │ Yes Yes - - - Yes - -
|
Czech cs 3 - │ Yes Yes - - - Yes - -
|
||||||
Danish da 3 - │ Yes Yes - - - Yes - -
|
Danish da 3 - │ Yes Yes - - - Yes - -
|
||||||
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||||
Greek el 3 - │ Yes Yes - - Yes - - -
|
|
||||||
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||||
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
|
||||||
Persian fa 3 - │ Yes Yes - - - Yes - -
|
|
||||||
Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
|
Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||||
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||||
|
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||||
|
Greek el 3 - │ Yes Yes - - Yes - - -
|
||||||
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
|
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
|
||||||
Hindi hi 3 - │ Yes - - - - Yes Yes -
|
Hindi hi 3 - │ Yes - - - - Yes Yes -
|
||||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
|
||||||
Hungarian hu 3 - │ Yes Yes - - Yes - - -
|
Hungarian hu 3 - │ Yes Yes - - Yes - - -
|
||||||
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
||||||
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||||
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
||||||
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
||||||
|
Macedonian mk 3 - │ Yes Yes Yes - - - - -
|
||||||
Malay ms 3 - │ Yes Yes - - - Yes - -
|
Malay ms 3 - │ Yes Yes - - - Yes - -
|
||||||
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
|
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
|
||||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
Persian fa 3 - │ Yes Yes - - - Yes - -
|
||||||
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
|
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||||
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||||
Romanian ro 3 - │ Yes Yes - - - Yes - -
|
Romanian ro 3 - │ Yes Yes - - - Yes - -
|
||||||
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
|
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
|
||||||
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
||||||
|
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||||
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
|
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
|
||||||
Turkish tr 3 - │ Yes Yes - - - Yes - -
|
Turkish tr 3 - │ Yes Yes - - - Yes - -
|
||||||
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
|
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
|
||||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
|
||||||
|
|
||||||
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
||||||
they share most of their vocabulary and grammar, they were once considered the
|
they share most of their vocabulary and grammar, they were once considered the
|
||||||
@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
|
|||||||
languages" below.
|
languages" below.
|
||||||
|
|
||||||
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||||
between 1.0 and 3.0. These are available in 12 languages that are covered by
|
between 1.0 and 3.0. These are available in 13 languages that are covered by
|
||||||
enough data sources.
|
enough data sources.
|
||||||
|
|
||||||
|
|
||||||
@ -314,7 +256,7 @@ into multiple tokens:
|
|||||||
>>> zipf_frequency('New York', 'en')
|
>>> zipf_frequency('New York', 'en')
|
||||||
5.35
|
5.35
|
||||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
3.56
|
3.55
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||||
@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
|||||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||||
|
|
||||||
|
|
||||||
|
## Additional CJK installation
|
||||||
|
|
||||||
|
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||||
|
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||||
|
in increasing order of difficulty.
|
||||||
|
|
||||||
|
|
||||||
|
### Chinese
|
||||||
|
|
||||||
|
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||||
|
pure-Python Chinese tokenizer:
|
||||||
|
|
||||||
|
pip3 install jieba
|
||||||
|
|
||||||
|
|
||||||
|
### Japanese
|
||||||
|
|
||||||
|
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||||
|
things need to be installed:
|
||||||
|
|
||||||
|
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||||
|
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||||
|
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||||
|
* The `mecab-python3` Python interface
|
||||||
|
|
||||||
|
To install these three things on Ubuntu, you can run:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||||
|
pip3 install mecab-python3
|
||||||
|
```
|
||||||
|
|
||||||
|
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||||
|
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||||
|
give you nonsense results.
|
||||||
|
|
||||||
|
|
||||||
|
### Korean
|
||||||
|
|
||||||
|
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||||
|
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||||
|
|
||||||
|
Here's a process you can use to install the Korean dictionary and the other
|
||||||
|
MeCab dependencies:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
sudo apt-get install libmecab-dev mecab-utils
|
||||||
|
pip3 install mecab-python3
|
||||||
|
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||||
|
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||||
|
cd mecab-ko-dic-2.0.1-20150920
|
||||||
|
./autogen.sh
|
||||||
|
make
|
||||||
|
sudo make install
|
||||||
|
```
|
||||||
|
|
||||||
|
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||||
|
tokenize those languages, it will raise an error and show you the list of
|
||||||
|
paths it searched.
|
||||||
|
|
||||||
|
Sorry that this is difficult. We tried to just package the data files we need
|
||||||
|
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||||
|
being too large.
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
`wordfreq` is freely redistributable under the MIT license (see
|
`wordfreq` is freely redistributable under the MIT license (see
|
||||||
|
Loading…
Reference in New Issue
Block a user