mirror of
https://github.com/rspeer/wordfreq.git
synced 2025-01-13 12:45:58 +00:00
update README for 1.7; sort language list in English order
This commit is contained in:
parent
46e32fbd36
commit
fb4a7db6f7
163
README.md
163
README.md
@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:
|
||||
|
||||
python3 setup.py install
|
||||
|
||||
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||
in increasing order of difficulty.
|
||||
|
||||
|
||||
### Chinese
|
||||
|
||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||
pure-Python Chinese tokenizer:
|
||||
|
||||
pip3 install jieba
|
||||
|
||||
|
||||
### Japanese
|
||||
|
||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||
things need to be installed:
|
||||
|
||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||
* The `mecab-python3` Python interface
|
||||
|
||||
To install these three things on Ubuntu, you can run:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||
pip3 install mecab-python3
|
||||
```
|
||||
|
||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||
give you nonsense results.
|
||||
|
||||
|
||||
### Korean
|
||||
|
||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||
|
||||
Here's a process you can use to install the Korean dictionary and the other
|
||||
MeCab dependencies:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-utils
|
||||
pip3 install mecab-python3
|
||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
cd mecab-ko-dic-2.0.1-20150920
|
||||
./autogen.sh
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||
tokenize those languages, it will raise an error and show you the list of
|
||||
paths it searched.
|
||||
|
||||
Sorry that this is difficult. We tried to just package the data files we need
|
||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||
being too large.
|
||||
See [Additional CJK installation][#additional-cjk-installation] for extra
|
||||
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
|
||||
|
||||
## Usage
|
||||
@ -175,10 +113,10 @@ the list, in descending frequency order.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
>>> top_n_list('en', 10)
|
||||
['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||
|
||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
@ -197,10 +135,12 @@ will select each random word from 2^n words.
|
||||
|
||||
If you happen to want an easy way to get [a memorable, xkcd-style
|
||||
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
||||
job. In this case, you should actually run the similar function `random_ascii_words`,
|
||||
limiting the selection to words that can be typed in ASCII.
|
||||
job. In this case, you should actually run the similar function
|
||||
`random_ascii_words`, limiting the selection to words that can be typed in
|
||||
ASCII. But maybe you should just use [xkpa][].
|
||||
|
||||
[xkcd936]: https://xkcd.com/936/
|
||||
[xkpa]: https://github.com/beala/xkcd-password
|
||||
|
||||
|
||||
## Sources and supported languages
|
||||
@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
|
||||
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
||||
──────────────────────────────┼────────────────────────────────────────────────
|
||||
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
|
||||
Bengali bn 3 - │ Yes - Yes - - Yes - -
|
||||
Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
|
||||
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
|
||||
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
|
||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||
Czech cs 3 - │ Yes Yes - - - Yes - -
|
||||
Danish da 3 - │ Yes Yes - - - Yes - -
|
||||
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Greek el 3 - │ Yes Yes - - Yes - - -
|
||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Persian fa 3 - │ Yes Yes - - - Yes - -
|
||||
Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Greek el 3 - │ Yes Yes - - Yes - - -
|
||||
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
|
||||
Hindi hi 3 - │ Yes - - - - Yes Yes -
|
||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||
Hungarian hu 3 - │ Yes Yes - - Yes - - -
|
||||
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
||||
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
||||
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
||||
Macedonian mk 3 - │ Yes Yes Yes - - - - -
|
||||
Malay ms 3 - │ Yes Yes - - - Yes - -
|
||||
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
|
||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||
Persian fa 3 - │ Yes Yes - - - Yes - -
|
||||
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Romanian ro 3 - │ Yes Yes - - - Yes - -
|
||||
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
|
||||
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
||||
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
|
||||
Turkish tr 3 - │ Yes Yes - - - Yes - -
|
||||
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
|
||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
|
||||
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
||||
they share most of their vocabulary and grammar, they were once considered the
|
||||
@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
|
||||
languages" below.
|
||||
|
||||
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||
between 1.0 and 3.0. These are available in 12 languages that are covered by
|
||||
between 1.0 and 3.0. These are available in 13 languages that are covered by
|
||||
enough data sources.
|
||||
|
||||
|
||||
@ -314,7 +256,7 @@ into multiple tokens:
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.35
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.56
|
||||
3.55
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||
|
||||
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||
in increasing order of difficulty.
|
||||
|
||||
|
||||
### Chinese
|
||||
|
||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||
pure-Python Chinese tokenizer:
|
||||
|
||||
pip3 install jieba
|
||||
|
||||
|
||||
### Japanese
|
||||
|
||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||
things need to be installed:
|
||||
|
||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||
* The `mecab-python3` Python interface
|
||||
|
||||
To install these three things on Ubuntu, you can run:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||
pip3 install mecab-python3
|
||||
```
|
||||
|
||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||
give you nonsense results.
|
||||
|
||||
|
||||
### Korean
|
||||
|
||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||
|
||||
Here's a process you can use to install the Korean dictionary and the other
|
||||
MeCab dependencies:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-utils
|
||||
pip3 install mecab-python3
|
||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
cd mecab-ko-dic-2.0.1-20150920
|
||||
./autogen.sh
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||
tokenize those languages, it will raise an error and show you the list of
|
||||
paths it searched.
|
||||
|
||||
Sorry that this is difficult. We tried to just package the data files we need
|
||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||
being too large.
|
||||
|
||||
|
||||
## License
|
||||
|
||||
`wordfreq` is freely redistributable under the MIT license (see
|
||||
|
Loading…
Reference in New Issue
Block a user