mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #51 from LuminosoInsight/version1.7
Version 1.7: update tokenization, update Wikipedia data, add languages
This commit is contained in:
commit
95a13ab4ce
20
CHANGELOG.md
20
CHANGELOG.md
@ -1,3 +1,23 @@
|
||||
## Version 1.7.0 (2017-08-25)
|
||||
|
||||
- Tokenization will always keep Unicode graphemes together, including
|
||||
complex emoji introduced in Unicode 10
|
||||
- Update the Wikipedia source data to April 2017
|
||||
- Remove some non-words, such as the Unicode replacement character and the
|
||||
pilcrow sign, from frequency lists
|
||||
- Support Bengali and Macedonian, which passed the threshold of having enough
|
||||
source data to be included
|
||||
|
||||
|
||||
## Version 1.6.1 (2017-05-10)
|
||||
|
||||
- Depend on langcodes 1.4, with a new language-matching system that does not
|
||||
depend on SQLite.
|
||||
|
||||
This prevents silly conflicts where langcodes' SQLite connection was
|
||||
preventing langcodes from being used in threads.
|
||||
|
||||
|
||||
## Version 1.6.0 (2017-01-05)
|
||||
|
||||
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
|
||||
|
163
README.md
163
README.md
@ -16,70 +16,8 @@ or by getting the repository and running its setup.py:
|
||||
|
||||
python3 setup.py install
|
||||
|
||||
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||
in increasing order of difficulty.
|
||||
|
||||
|
||||
### Chinese
|
||||
|
||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||
pure-Python Chinese tokenizer:
|
||||
|
||||
pip3 install jieba
|
||||
|
||||
|
||||
### Japanese
|
||||
|
||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||
things need to be installed:
|
||||
|
||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||
* The `mecab-python3` Python interface
|
||||
|
||||
To install these three things on Ubuntu, you can run:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||
pip3 install mecab-python3
|
||||
```
|
||||
|
||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||
give you nonsense results.
|
||||
|
||||
|
||||
### Korean
|
||||
|
||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||
|
||||
Here's a process you can use to install the Korean dictionary and the other
|
||||
MeCab dependencies:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-utils
|
||||
pip3 install mecab-python3
|
||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
cd mecab-ko-dic-2.0.1-20150920
|
||||
./autogen.sh
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||
tokenize those languages, it will raise an error and show you the list of
|
||||
paths it searched.
|
||||
|
||||
Sorry that this is difficult. We tried to just package the data files we need
|
||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||
being too large.
|
||||
See [Additional CJK installation][#additional-cjk-installation] for extra
|
||||
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
|
||||
|
||||
## Usage
|
||||
@ -175,10 +113,10 @@ the list, in descending frequency order.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
>>> top_n_list('en', 10)
|
||||
['the', 'to', 'of', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
['the', 'of', 'to', 'and', 'a', 'in', 'i', 'is', 'that', 'for']
|
||||
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'en', 'el', 'y', 'a', 'los', 'no', 'se']
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'no', 'se']
|
||||
|
||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
@ -197,10 +135,12 @@ will select each random word from 2^n words.
|
||||
|
||||
If you happen to want an easy way to get [a memorable, xkcd-style
|
||||
password][xkcd936] with 60 bits of entropy, this function will almost do the
|
||||
job. In this case, you should actually run the similar function `random_ascii_words`,
|
||||
limiting the selection to words that can be typed in ASCII.
|
||||
job. In this case, you should actually run the similar function
|
||||
`random_ascii_words`, limiting the selection to words that can be typed in
|
||||
ASCII. But maybe you should just use [xkpa][].
|
||||
|
||||
[xkcd936]: https://xkcd.com/936/
|
||||
[xkpa]: https://github.com/beala/xkcd-password
|
||||
|
||||
|
||||
## Sources and supported languages
|
||||
@ -230,38 +170,40 @@ least 3 different sources of word frequencies:
|
||||
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
||||
──────────────────────────────┼────────────────────────────────────────────────
|
||||
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
|
||||
Bengali bn 3 - │ Yes - Yes - - Yes - -
|
||||
Bosnian bs [1] 3 - │ Yes Yes - - - Yes - -
|
||||
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
|
||||
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
|
||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||
Czech cs 3 - │ Yes Yes - - - Yes - -
|
||||
Danish da 3 - │ Yes Yes - - - Yes - -
|
||||
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Greek el 3 - │ Yes Yes - - Yes - - -
|
||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||
English en 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Persian fa 3 - │ Yes Yes - - - Yes - -
|
||||
Finnish fi 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
German de 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Greek el 3 - │ Yes Yes - - Yes - - -
|
||||
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
|
||||
Hindi hi 3 - │ Yes - - - - Yes Yes -
|
||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||
Hungarian hu 3 - │ Yes Yes - - Yes - - -
|
||||
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
||||
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
||||
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
||||
Macedonian mk 3 - │ Yes Yes Yes - - - - -
|
||||
Malay ms 3 - │ Yes Yes - - - Yes - -
|
||||
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
|
||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||
Persian fa 3 - │ Yes Yes - - - Yes - -
|
||||
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Romanian ro 3 - │ Yes Yes - - - Yes - -
|
||||
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
|
||||
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
||||
Spanish es 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
|
||||
Turkish tr 3 - │ Yes Yes - - - Yes - -
|
||||
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
|
||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
|
||||
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
||||
they share most of their vocabulary and grammar, they were once considered the
|
||||
@ -277,7 +219,7 @@ Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
|
||||
languages" below.
|
||||
|
||||
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||
between 1.0 and 3.0. These are available in 12 languages that are covered by
|
||||
between 1.0 and 3.0. These are available in 13 languages that are covered by
|
||||
enough data sources.
|
||||
|
||||
|
||||
@ -314,7 +256,7 @@ into multiple tokens:
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.35
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.56
|
||||
3.55
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
@ -381,6 +323,71 @@ frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||
|
||||
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
they can be tokenized correctly. Here we'll explain how to set them up,
|
||||
in increasing order of difficulty.
|
||||
|
||||
|
||||
### Chinese
|
||||
|
||||
To be able to look up word frequencies in Chinese, you need Jieba, a
|
||||
pure-Python Chinese tokenizer:
|
||||
|
||||
pip3 install jieba
|
||||
|
||||
|
||||
### Japanese
|
||||
|
||||
We use MeCab, by Taku Kudo, to tokenize Japanese. To use this in wordfreq, three
|
||||
things need to be installed:
|
||||
|
||||
* The MeCab development library (called `libmecab-dev` on Ubuntu)
|
||||
* The UTF-8 version of the `ipadic` Japanese dictionary
|
||||
(called `mecab-ipadic-utf8` on Ubuntu)
|
||||
* The `mecab-python3` Python interface
|
||||
|
||||
To install these three things on Ubuntu, you can run:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-ipadic-utf8
|
||||
pip3 install mecab-python3
|
||||
```
|
||||
|
||||
If you choose to install `ipadic` from somewhere else or from its source code,
|
||||
be sure it's configured to use UTF-8. By default it will use EUC-JP, which will
|
||||
give you nonsense results.
|
||||
|
||||
|
||||
### Korean
|
||||
|
||||
Korean also uses MeCab, with a Korean dictionary package by Yongwoon Lee and
|
||||
Yungho Yu. This dictionary is not available as an Ubuntu package.
|
||||
|
||||
Here's a process you can use to install the Korean dictionary and the other
|
||||
MeCab dependencies:
|
||||
|
||||
```sh
|
||||
sudo apt-get install libmecab-dev mecab-utils
|
||||
pip3 install mecab-python3
|
||||
wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
tar xvf mecab-ko-dic-2.0.1-20150920.tar.gz
|
||||
cd mecab-ko-dic-2.0.1-20150920
|
||||
./autogen.sh
|
||||
make
|
||||
sudo make install
|
||||
```
|
||||
|
||||
If wordfreq cannot find the Japanese or Korean data for MeCab when asked to
|
||||
tokenize those languages, it will raise an error and show you the list of
|
||||
paths it searched.
|
||||
|
||||
Sorry that this is difficult. We tried to just package the data files we need
|
||||
with wordfreq, like we do for Chinese, but PyPI would reject the package for
|
||||
being too large.
|
||||
|
||||
|
||||
## License
|
||||
|
||||
`wordfreq` is freely redistributable under the MIT license (see
|
||||
|
14
scripts/top_n.py
Normal file
14
scripts/top_n.py
Normal file
@ -0,0 +1,14 @@
|
||||
"""
|
||||
A quick script to output the top N words (1000 for now) in each language.
|
||||
You can send the output to a file and diff it to see changes between wordfreq
|
||||
versions.
|
||||
"""
|
||||
import wordfreq
|
||||
|
||||
|
||||
N = 1000
|
||||
|
||||
|
||||
for lang in sorted(wordfreq.available_languages()):
|
||||
for word in wordfreq.top_n_list(lang, 1000):
|
||||
print('{}\t{}'.format(lang, word))
|
@ -35,6 +35,8 @@ LAUGHTER_WORDS = {
|
||||
'he': 'חחח',
|
||||
'bg': 'ахаха',
|
||||
'uk': 'хаха',
|
||||
'bn': 'হা হা',
|
||||
'mk': 'хаха'
|
||||
}
|
||||
|
||||
|
||||
@ -190,7 +192,7 @@ def test_not_really_random():
|
||||
# This not only tests random_ascii_words, it makes sure we didn't end
|
||||
# up with 'eos' as a very common Japanese word
|
||||
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
|
||||
'00 00 00 00')
|
||||
'1 1 1 1')
|
||||
|
||||
|
||||
@raises(ValueError)
|
||||
|
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_bn.msgpack.gz
Normal file
BIN
wordfreq/data/combined_bn.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_mk.msgpack.gz
Normal file
BIN
wordfreq/data/combined_mk.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_bn.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_bn.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue
Block a user