mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
This commit is contained in:
parent
0a2bfb2710
commit
2a41d4dc5e
114
README.md
114
README.md
@ -60,16 +60,16 @@ frequencies by a million (1e6) to get more readable numbers:
|
|||||||
|
|
||||||
>>> from wordfreq import word_frequency
|
>>> from wordfreq import word_frequency
|
||||||
>>> word_frequency('cafe', 'en') * 1e6
|
>>> word_frequency('cafe', 'en') * 1e6
|
||||||
14.45439770745928
|
12.88249551693135
|
||||||
|
|
||||||
>>> word_frequency('café', 'en') * 1e6
|
>>> word_frequency('café', 'en') * 1e6
|
||||||
4.7863009232263805
|
3.3884415613920273
|
||||||
|
|
||||||
>>> word_frequency('cafe', 'fr') * 1e6
|
>>> word_frequency('cafe', 'fr') * 1e6
|
||||||
2.0417379446695274
|
2.6302679918953817
|
||||||
|
|
||||||
>>> word_frequency('café', 'fr') * 1e6
|
>>> word_frequency('café', 'fr') * 1e6
|
||||||
77.62471166286912
|
87.09635899560814
|
||||||
|
|
||||||
|
|
||||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||||
@ -85,20 +85,21 @@ described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
|||||||
for words that do not appear in the given wordlist, although it should mean
|
for words that do not appear in the given wordlist, although it should mean
|
||||||
one occurrence per billion words.
|
one occurrence per billion words.
|
||||||
|
|
||||||
|
>>> from wordfreq import zipf_frequency
|
||||||
>>> zipf_frequency('the', 'en')
|
>>> zipf_frequency('the', 'en')
|
||||||
7.59
|
7.67
|
||||||
|
|
||||||
>>> zipf_frequency('word', 'en')
|
>>> zipf_frequency('word', 'en')
|
||||||
5.34
|
5.39
|
||||||
|
|
||||||
>>> zipf_frequency('frequency', 'en')
|
>>> zipf_frequency('frequency', 'en')
|
||||||
4.44
|
4.19
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en')
|
>>> zipf_frequency('zipf', 'en')
|
||||||
0.0
|
0.0
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
||||||
1.42
|
1.65
|
||||||
|
|
||||||
|
|
||||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||||
@ -128,10 +129,10 @@ the list, in descending frequency order.
|
|||||||
|
|
||||||
>>> from wordfreq import top_n_list
|
>>> from wordfreq import top_n_list
|
||||||
>>> top_n_list('en', 10)
|
>>> top_n_list('en', 10)
|
||||||
['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']
|
['the', 'i', 'to', 'a', 'and', 'of', 'you', 'in', 'that', 'is']
|
||||||
|
|
||||||
>>> top_n_list('es', 10)
|
>>> top_n_list('es', 10)
|
||||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']
|
['de', 'que', 'la', 'y', 'a', 'en', 'el', 'no', 'los', 'es']
|
||||||
|
|
||||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||||
wordlist, in descending frequency order.
|
wordlist, in descending frequency order.
|
||||||
@ -168,48 +169,56 @@ The sources (and the abbreviations we'll use for them) are:
|
|||||||
- **Twitter**: Messages sampled from Twitter's public stream
|
- **Twitter**: Messages sampled from Twitter's public stream
|
||||||
- **Wpedia**: The full text of Wikipedia in 2015
|
- **Wpedia**: The full text of Wikipedia in 2015
|
||||||
- **Reddit**: The corpus of Reddit comments through May 2015
|
- **Reddit**: The corpus of Reddit comments through May 2015
|
||||||
|
- **CCrawl**: Text extracted from the Common Crawl and language-detected with cld2
|
||||||
- **Other**: We get additional English frequencies from Google Books Syntactic
|
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||||
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||||
comes with the Jieba tokenizer.
|
comes with the Jieba tokenizer.
|
||||||
|
|
||||||
The following 17 languages are well-supported, with reasonable tokenization and
|
The following 27 languages are supported, with reasonable tokenization and at
|
||||||
at least 3 different sources of word frequencies:
|
least 3 different sources of word frequencies:
|
||||||
|
|
||||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other
|
Language Code Sources Large? SUBTLEX OpenSub LeedsIC Twitter Wpedia CCrawl Reddit Other
|
||||||
──────────────────┼─────────────────────────────────────────────────────
|
───────────────────────────────────┼──────────────────────────────────────────────────────────────
|
||||||
Arabic ar │ - Yes Yes Yes Yes - -
|
Arabic ar 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||||
German de │ Yes - Yes Yes[1] Yes - -
|
Bulgarian bg 3 - │ - Yes - - Yes Yes - -
|
||||||
Greek el │ - Yes Yes Yes Yes - -
|
Catalan ca 3 - │ - Yes - Yes Yes - - -
|
||||||
English en │ Yes Yes Yes Yes Yes Yes Google Books
|
Danish da 3 - │ - Yes - - Yes Yes - -
|
||||||
Spanish es │ - Yes Yes Yes Yes - -
|
German de 5 Yes │ Yes - Yes Yes Yes Yes - -
|
||||||
French fr │ - Yes Yes Yes Yes - -
|
Greek el 4 - │ - Yes Yes - Yes Yes - -
|
||||||
Indonesian id │ - Yes - Yes Yes - -
|
English en 7 Yes │ Yes Yes Yes Yes Yes - Yes Google Books
|
||||||
Italian it │ - Yes Yes Yes Yes - -
|
Spanish es 6 Yes │ - Yes Yes Yes Yes Yes Yes -
|
||||||
Japanese ja │ - - Yes Yes Yes - -
|
Finnish fi 3 - │ - Yes - - Yes Yes - -
|
||||||
Malay ms │ - Yes - Yes Yes - -
|
French fr 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||||
Dutch nl │ Yes Yes - Yes Yes - -
|
Hebrew he 4 - │ - Yes - Yes Yes Yes - -
|
||||||
Polish pl │ - Yes - Yes Yes - -
|
Hindi hi 3 - │ - - - Yes Yes Yes - -
|
||||||
Portuguese pt │ - Yes Yes Yes Yes - -
|
Hungarian hu 3 - │ - Yes - - Yes Yes - -
|
||||||
Russian ru │ - Yes Yes Yes Yes - -
|
Indonesian id 4 - │ - Yes - Yes Yes Yes - -
|
||||||
Swedish sv │ - Yes - Yes Yes - -
|
Italian it 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||||
Turkish tr │ - Yes - Yes Yes - -
|
Japanese ja 4 - │ - - Yes Yes Yes Yes - -
|
||||||
Chinese zh │ Yes - Yes - - - Jieba
|
Korean ko 3 - │ - - - Yes Yes Yes - -
|
||||||
|
Malay ms 4 - │ - Yes - Yes Yes Yes - -
|
||||||
|
Norwegian nb[1] 3 - │ - Yes - - Yes Yes - -
|
||||||
|
Dutch nl 5 Yes │ Yes Yes - Yes Yes Yes - -
|
||||||
|
Polish pl 4 - │ - Yes - Yes Yes Yes - -
|
||||||
|
Portuguese pt 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||||
|
Romanian ro 3 - │ - Yes - - Yes Yes - -
|
||||||
|
Russian ru 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||||
|
Swedish sv 4 - │ - Yes - Yes Yes Yes - -
|
||||||
|
Turkish tr 4 - │ - Yes - Yes Yes Yes - -
|
||||||
|
Chinese zh[2] 5 - │ Yes - Yes - Yes Yes - Jieba
|
||||||
|
|
||||||
|
[1] The Norwegian text we have is specifically written in Norwegian Bokmål, so
|
||||||
|
we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
|
||||||
|
isn't enough data to include it in wordfreq.
|
||||||
|
|
||||||
Additionally, Korean is marginally supported. You can look up frequencies in
|
[2] This data represents text written in both Simplified and Traditional
|
||||||
it, but it will be insufficiently tokenized into words, and we have too few
|
Chinese. (SUBTLEX is mostly Simplified, while Wikipedia is mostly Traditional.)
|
||||||
data sources for it so far:
|
The characters are mapped to one another so they can use the same word
|
||||||
|
frequency list.
|
||||||
|
|
||||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit
|
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||||
──────────────────┼───────────────────────────────────────────────
|
between 1.0 and 3.0. These are available in 9 languages that are covered by
|
||||||
Korean ko │ - - - Yes Yes -
|
enough data sources.
|
||||||
|
|
||||||
The 'large' wordlists are available in English, German, Spanish, French, and
|
|
||||||
Portuguese.
|
|
||||||
|
|
||||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
|
||||||
you should be aware that German is not a frequently-used language on Twitter.
|
|
||||||
Germans just don't tweet that much.
|
|
||||||
|
|
||||||
|
|
||||||
## Tokenization
|
## Tokenization
|
||||||
@ -223,10 +232,13 @@ splits words between apostrophes and vowels.
|
|||||||
|
|
||||||
There are language-specific exceptions:
|
There are language-specific exceptions:
|
||||||
|
|
||||||
- In Arabic, it additionally normalizes ligatures and removes combining marks.
|
- In Arabic and Hebrew, it additionally normalizes ligatures and removes
|
||||||
- In Japanese, instead of using the regex library, it uses the external library
|
combining marks.
|
||||||
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
|
|
||||||
it requires the `libmecab-dev` system package to be installed.
|
- In Japanese and Korean, instead of using the regex library, it uses the
|
||||||
|
external library `mecab-python3`. This is an optional dependency of wordfreq,
|
||||||
|
and compiling it requires the `libmecab-dev` system package to be installed.
|
||||||
|
|
||||||
- In Chinese, it uses the external Python library `jieba`, another optional
|
- In Chinese, it uses the external Python library `jieba`, another optional
|
||||||
dependency.
|
dependency.
|
||||||
|
|
||||||
@ -240,9 +252,9 @@ also try to deal gracefully when you query it with texts that actually break
|
|||||||
into multiple tokens:
|
into multiple tokens:
|
||||||
|
|
||||||
>>> zipf_frequency('New York', 'en')
|
>>> zipf_frequency('New York', 'en')
|
||||||
5.31
|
5.07
|
||||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
3.51
|
3.58
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||||
@ -257,7 +269,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
|
|||||||
their frequency:
|
their frequency:
|
||||||
|
|
||||||
>>> zipf_frequency('owl-flavored', 'en')
|
>>> zipf_frequency('owl-flavored', 'en')
|
||||||
3.18
|
3.19
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
2
setup.py
2
setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):
|
|||||||
|
|
||||||
setup(
|
setup(
|
||||||
name="wordfreq",
|
name="wordfreq",
|
||||||
version='1.4.2',
|
version='1.5',
|
||||||
maintainer='Luminoso Technologies, Inc.',
|
maintainer='Luminoso Technologies, Inc.',
|
||||||
maintainer_email='info@luminoso.com',
|
maintainer_email='info@luminoso.com',
|
||||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||||
|
@ -19,23 +19,43 @@ def test_freq_examples():
|
|||||||
def test_languages():
|
def test_languages():
|
||||||
# Make sure the number of available languages doesn't decrease
|
# Make sure the number of available languages doesn't decrease
|
||||||
avail = available_languages()
|
avail = available_languages()
|
||||||
assert_greater(len(avail), 15)
|
assert_greater(len(avail), 26)
|
||||||
|
|
||||||
|
avail_twitter = available_languages('twitter')
|
||||||
|
assert_greater(len(avail_twitter), 15)
|
||||||
# Look up a word representing laughter in each language, and make sure
|
# Look up a word representing laughter in each language, and make sure
|
||||||
# it has a non-zero frequency.
|
# it has a non-zero frequency in the informal 'twitter' list.
|
||||||
for lang in avail:
|
for lang in avail_twitter:
|
||||||
if lang in {'zh', 'ja'}:
|
if lang == 'zh' or lang == 'ja':
|
||||||
text = '笑'
|
text = '笑'
|
||||||
|
elif lang == 'ko':
|
||||||
|
text = 'ᄏᄏᄏ'
|
||||||
elif lang == 'ar':
|
elif lang == 'ar':
|
||||||
text = 'ههههه'
|
text = 'ههههه'
|
||||||
|
elif lang == 'ca' or lang == 'es':
|
||||||
|
text = 'jaja'
|
||||||
|
elif lang in {'de', 'nb', 'sv', 'da'}:
|
||||||
|
text = 'haha'
|
||||||
|
elif lang == 'pt':
|
||||||
|
text = 'kkkk'
|
||||||
|
elif lang == 'he':
|
||||||
|
text = 'חחח'
|
||||||
|
elif lang == 'ru':
|
||||||
|
text = 'лол'
|
||||||
|
elif lang == 'bg':
|
||||||
|
text = 'хаха'
|
||||||
|
elif lang == 'ro':
|
||||||
|
text = 'haha'
|
||||||
|
elif lang == 'el':
|
||||||
|
text = 'χαχα'
|
||||||
else:
|
else:
|
||||||
text = 'lol'
|
text = 'lol'
|
||||||
assert_greater(word_frequency(text, lang), 0)
|
assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
|
||||||
|
|
||||||
# Make up a weirdly verbose language code and make sure
|
# Make up a weirdly verbose language code and make sure
|
||||||
# we still get it
|
# we still get it
|
||||||
new_lang_code = '%s-001-x-fake-extension' % lang.upper()
|
new_lang_code = '%s-001-x-fake-extension' % lang.upper()
|
||||||
assert_greater(word_frequency(text, new_lang_code), 0, (text, new_lang_code))
|
assert_greater(word_frequency(text, new_lang_code, wordlist='twitter'), 0, (text, new_lang_code))
|
||||||
|
|
||||||
|
|
||||||
def test_twitter():
|
def test_twitter():
|
||||||
@ -62,7 +82,7 @@ def test_most_common_words():
|
|||||||
"""
|
"""
|
||||||
return top_n_list(lang, 1)[0]
|
return top_n_list(lang, 1)[0]
|
||||||
|
|
||||||
eq_(get_most_common('ar'), 'في')
|
eq_(get_most_common('ar'), 'من')
|
||||||
eq_(get_most_common('de'), 'die')
|
eq_(get_most_common('de'), 'die')
|
||||||
eq_(get_most_common('en'), 'the')
|
eq_(get_most_common('en'), 'the')
|
||||||
eq_(get_most_common('es'), 'de')
|
eq_(get_most_common('es'), 'de')
|
||||||
@ -144,12 +164,12 @@ def test_not_really_random():
|
|||||||
# This not only tests random_ascii_words, it makes sure we didn't end
|
# This not only tests random_ascii_words, it makes sure we didn't end
|
||||||
# up with 'eos' as a very common Japanese word
|
# up with 'eos' as a very common Japanese word
|
||||||
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
|
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
|
||||||
'rt rt rt rt')
|
'1 1 1 1')
|
||||||
|
|
||||||
|
|
||||||
@raises(ValueError)
|
@raises(ValueError)
|
||||||
def test_not_enough_ascii():
|
def test_not_enough_ascii():
|
||||||
random_ascii_words(lang='zh')
|
random_ascii_words(lang='zh', bits_per_word=14)
|
||||||
|
|
||||||
|
|
||||||
def test_arabic():
|
def test_arabic():
|
||||||
@ -199,3 +219,10 @@ def test_other_languages():
|
|||||||
# Remove vowel points in Hebrew
|
# Remove vowel points in Hebrew
|
||||||
eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה'])
|
eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה'])
|
||||||
|
|
||||||
|
# Deal with commas, cedillas, and I's in Turkish
|
||||||
|
eq_(tokenize('kișinin', 'tr'), ['kişinin'])
|
||||||
|
eq_(tokenize('KİȘİNİN', 'tr'), ['kişinin'])
|
||||||
|
|
||||||
|
# Deal with cedillas that should be commas-below in Romanian
|
||||||
|
eq_(tokenize('acelaşi', 'ro'), ['același'])
|
||||||
|
eq_(tokenize('ACELAŞI', 'ro'), ['același'])
|
||||||
|
Binary file not shown.
BIN
wordfreq/data/combined_bg.msgpack.gz
Normal file
BIN
wordfreq/data/combined_bg.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_ca.msgpack.gz
Normal file
BIN
wordfreq/data/combined_ca.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_da.msgpack.gz
Normal file
BIN
wordfreq/data/combined_da.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_fi.msgpack.gz
Normal file
BIN
wordfreq/data/combined_fi.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_he.msgpack.gz
Normal file
BIN
wordfreq/data/combined_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_hi.msgpack.gz
Normal file
BIN
wordfreq/data/combined_hi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_hu.msgpack.gz
Normal file
BIN
wordfreq/data/combined_hu.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_nb.msgpack.gz
Normal file
BIN
wordfreq/data/combined_nb.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_ro.msgpack.gz
Normal file
BIN
wordfreq/data/combined_ro.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
BIN
wordfreq/data/large_ar.msgpack.gz
Normal file
BIN
wordfreq/data/large_ar.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/large_it.msgpack.gz
Normal file
BIN
wordfreq/data/large_it.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_nl.msgpack.gz
Normal file
BIN
wordfreq/data/large_nl.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/large_ru.msgpack.gz
Normal file
BIN
wordfreq/data/large_ru.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_ca.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_ca.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_he.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/twitter_hi.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_hi.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -23,6 +23,9 @@ def mecab_tokenize(text, lang):
|
|||||||
raise ValueError("Can't run MeCab on language %r" % lang)
|
raise ValueError("Can't run MeCab on language %r" % lang)
|
||||||
analyzer = MECAB_ANALYZERS[lang]
|
analyzer = MECAB_ANALYZERS[lang]
|
||||||
text = unicodedata.normalize('NFKC', text.strip())
|
text = unicodedata.normalize('NFKC', text.strip())
|
||||||
|
analyzed = analyzer.parse(text)
|
||||||
|
if not analyzed:
|
||||||
|
return []
|
||||||
return [line.split('\t')[0]
|
return [line.split('\t')[0]
|
||||||
for line in analyzer.parse(text).split('\n')
|
for line in analyzed.split('\n')
|
||||||
if line != '' and line != 'EOS']
|
if line != '' and line != 'EOS']
|
||||||
|
@ -116,11 +116,26 @@ def simple_tokenize(text, include_punctuation=False):
|
|||||||
def turkish_tokenize(text, include_punctuation=False):
|
def turkish_tokenize(text, include_punctuation=False):
|
||||||
"""
|
"""
|
||||||
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
|
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
|
||||||
in Turkish.
|
in Turkish, and modifies 'comma-below' characters to use cedillas.
|
||||||
"""
|
"""
|
||||||
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
|
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
|
||||||
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
|
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
|
||||||
return [token.strip("'").casefold() for token in token_expr.findall(text)]
|
return [
|
||||||
|
commas_to_cedillas(token.strip("'").casefold())
|
||||||
|
for token in token_expr.findall(text)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def romanian_tokenize(text, include_punctuation=False):
|
||||||
|
"""
|
||||||
|
Like `simple_tokenize`, but modifies the letters ş and ţ (with cedillas)
|
||||||
|
to use commas-below instead.
|
||||||
|
"""
|
||||||
|
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
|
||||||
|
return [
|
||||||
|
cedillas_to_commas(token.strip("'").casefold())
|
||||||
|
for token in token_expr.findall(text)
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
def tokenize_mecab_language(text, lang, include_punctuation=False):
|
def tokenize_mecab_language(text, lang, include_punctuation=False):
|
||||||
@ -161,6 +176,34 @@ def remove_marks(text):
|
|||||||
return MARK_RE.sub('', text)
|
return MARK_RE.sub('', text)
|
||||||
|
|
||||||
|
|
||||||
|
def commas_to_cedillas(text):
|
||||||
|
"""
|
||||||
|
Convert s and t with commas (ș and ț) to cedillas (ş and ţ), which is
|
||||||
|
preferred in Turkish.
|
||||||
|
"""
|
||||||
|
return text.replace(
|
||||||
|
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}',
|
||||||
|
'\N{LATIN SMALL LETTER S WITH CEDILLA}'
|
||||||
|
).replace(
|
||||||
|
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}',
|
||||||
|
'\N{LATIN SMALL LETTER T WITH CEDILLA}'
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def cedillas_to_commas(text):
|
||||||
|
"""
|
||||||
|
Convert s and t with cedillas (ş and ţ) to commas (ș and ț), which is
|
||||||
|
preferred in Romanian.
|
||||||
|
"""
|
||||||
|
return text.replace(
|
||||||
|
'\N{LATIN SMALL LETTER S WITH CEDILLA}',
|
||||||
|
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}'
|
||||||
|
).replace(
|
||||||
|
'\N{LATIN SMALL LETTER T WITH CEDILLA}',
|
||||||
|
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}'
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
||||||
"""
|
"""
|
||||||
Tokenize this text in a way that's relatively simple but appropriate for
|
Tokenize this text in a way that's relatively simple but appropriate for
|
||||||
@ -263,6 +306,8 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
|||||||
return chinese_tokenize(text, include_punctuation, external_wordlist)
|
return chinese_tokenize(text, include_punctuation, external_wordlist)
|
||||||
elif lang == 'tr':
|
elif lang == 'tr':
|
||||||
return turkish_tokenize(text, include_punctuation)
|
return turkish_tokenize(text, include_punctuation)
|
||||||
|
elif lang == 'ro':
|
||||||
|
return romanian_tokenize(text, include_punctuation)
|
||||||
elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}:
|
elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}:
|
||||||
# Abjad languages
|
# Abjad languages
|
||||||
text = remove_marks(unicodedata.normalize('NFKC', text))
|
text = remove_marks(unicodedata.normalize('NFKC', text))
|
||||||
|
@ -91,6 +91,9 @@ rule convert_google_syntactic_ngrams
|
|||||||
rule count
|
rule count
|
||||||
command = python -m wordfreq_builder.cli.count_tokens $in $out
|
command = python -m wordfreq_builder.cli.count_tokens $in $out
|
||||||
|
|
||||||
|
rule count_langtagged
|
||||||
|
command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
|
||||||
|
|
||||||
rule merge
|
rule merge
|
||||||
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
|
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
|
||||||
|
|
||||||
|
@ -0,0 +1,21 @@
|
|||||||
|
"""
|
||||||
|
Count tokens of text in a particular language, taking input from a
|
||||||
|
tab-separated file whose first column is a language code. Lines in all
|
||||||
|
languages except the specified one will be skipped.
|
||||||
|
"""
|
||||||
|
from wordfreq_builder.word_counts import count_tokens_langtagged, write_wordlist
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
|
||||||
|
def handle_counts(filename_in, filename_out, lang):
|
||||||
|
counts = count_tokens_langtagged(filename_in, lang)
|
||||||
|
write_wordlist(counts, filename_out)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||||
|
parser.add_argument('filename_out', help='name of output file')
|
||||||
|
parser.add_argument('-l', '--language', help='language tag to filter lines for')
|
||||||
|
args = parser.parse_args()
|
||||||
|
handle_counts(args.filename_in, args.filename_out, args.language)
|
@ -10,15 +10,20 @@ CONFIG = {
|
|||||||
#
|
#
|
||||||
# Consider adding:
|
# Consider adding:
|
||||||
# 'th' when we get tokenization for it
|
# 'th' when we get tokenization for it
|
||||||
# 'hi' when we stop messing up its tokenization
|
|
||||||
# 'tl' with one more data source
|
# 'tl' with one more data source
|
||||||
|
# 'el' if we can filter out kaomoji
|
||||||
'twitter': [
|
'twitter': [
|
||||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
'ar', 'ca', 'de', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
|
||||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
'ja', 'ko', 'ms', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr'
|
||||||
],
|
],
|
||||||
|
# Languages with large Wikipedias. (Languages whose Wikipedia dump is
|
||||||
|
# at least 200 MB of .xml.bz2 are included. Some widely-spoken
|
||||||
|
# languages with 100 MB are also included, specifically Malay and
|
||||||
|
# Hindi.)
|
||||||
'wikipedia': [
|
'wikipedia': [
|
||||||
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
'ar', 'ca', 'de', 'el', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
|
||||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
'ja', 'ko', 'ms', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh',
|
||||||
|
'bg', 'da', 'fi', 'hu', 'ro', 'uk'
|
||||||
],
|
],
|
||||||
'opensubtitles': [
|
'opensubtitles': [
|
||||||
# This list includes languages where the most common word in
|
# This list includes languages where the most common word in
|
||||||
@ -43,9 +48,20 @@ CONFIG = {
|
|||||||
'jieba': ['zh'],
|
'jieba': ['zh'],
|
||||||
|
|
||||||
# About 99.2% of Reddit is in English. There are pockets of
|
# About 99.2% of Reddit is in English. There are pockets of
|
||||||
# conversation in other languages, but we're concerned that they're not
|
# conversation in other languages, some of which may not be
|
||||||
# representative enough for learning general word frequencies.
|
# representative enough for learning general word frequencies.
|
||||||
'reddit': ['en']
|
#
|
||||||
|
# However, there seem to be Spanish subreddits that are general enough
|
||||||
|
# (including /r/es and /r/mexico).
|
||||||
|
'reddit': ['en', 'es'],
|
||||||
|
|
||||||
|
# Well-represented languages in the Common Crawl
|
||||||
|
# It's possible we could add 'uk' to the list, needs more checking
|
||||||
|
'commoncrawl': [
|
||||||
|
'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'fa', 'fi', 'fr',
|
||||||
|
'he', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'ms', 'nb', 'nl',
|
||||||
|
'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'ta', 'tr', 'vi', 'zh'
|
||||||
|
],
|
||||||
},
|
},
|
||||||
# Subtlex languages that need to be pre-processed
|
# Subtlex languages that need to be pre-processed
|
||||||
'wordlist_paths': {
|
'wordlist_paths': {
|
||||||
@ -54,6 +70,7 @@ CONFIG = {
|
|||||||
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
|
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
|
||||||
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
|
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
|
||||||
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
|
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
|
||||||
|
'commoncrawl': 'generated/commoncrawl/commoncrawl_{lang}.{ext}',
|
||||||
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
|
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||||
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
|
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||||
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
|
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
|
||||||
@ -64,8 +81,15 @@ CONFIG = {
|
|||||||
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
||||||
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
||||||
},
|
},
|
||||||
'min_sources': 2,
|
'min_sources': 3,
|
||||||
'big-lists': ['en', 'fr', 'es', 'pt', 'de']
|
'big-lists': ['en', 'fr', 'es', 'pt', 'de', 'ar', 'it', 'nl', 'ru'],
|
||||||
|
# When dealing with language tags that come straight from cld2, we need
|
||||||
|
# to un-standardize a few of them
|
||||||
|
'cld2-language-aliases': {
|
||||||
|
'nb': 'no',
|
||||||
|
'he': 'iw',
|
||||||
|
'jw': 'jv'
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@ -87,6 +87,10 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
|
|||||||
data_filename('source-lists/jieba'),
|
data_filename('source-lists/jieba'),
|
||||||
CONFIG['sources']['jieba']
|
CONFIG['sources']['jieba']
|
||||||
),
|
),
|
||||||
|
commoncrawl_deps(
|
||||||
|
data_filename('raw-input/commoncrawl'),
|
||||||
|
CONFIG['sources']['commoncrawl']
|
||||||
|
),
|
||||||
combine_lists(all_languages())
|
combine_lists(all_languages())
|
||||||
))
|
))
|
||||||
|
|
||||||
@ -117,6 +121,19 @@ def wikipedia_deps(dirname_in, languages):
|
|||||||
return lines
|
return lines
|
||||||
|
|
||||||
|
|
||||||
|
def commoncrawl_deps(dirname_in, languages):
|
||||||
|
lines = []
|
||||||
|
for language in languages:
|
||||||
|
if language in CONFIG['cld2-language-aliases']:
|
||||||
|
language_alias = CONFIG['cld2-language-aliases'][language]
|
||||||
|
else:
|
||||||
|
language_alias = language
|
||||||
|
input_file = dirname_in + '/{}.txt.gz'.format(language_alias)
|
||||||
|
count_file = wordlist_filename('commoncrawl', language, 'counts.txt')
|
||||||
|
add_dep(lines, 'count_langtagged', input_file, count_file, params={'language': language_alias})
|
||||||
|
return lines
|
||||||
|
|
||||||
|
|
||||||
def google_books_deps(dirname_in):
|
def google_books_deps(dirname_in):
|
||||||
# Get English data from the split-up files of the Google Syntactic N-grams
|
# Get English data from the split-up files of the Google Syntactic N-grams
|
||||||
# 2013 corpus.
|
# 2013 corpus.
|
||||||
|
@ -2,10 +2,12 @@ from wordfreq import simple_tokenize, tokenize
|
|||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from operator import itemgetter
|
from operator import itemgetter
|
||||||
from ftfy import fix_text
|
from ftfy import fix_text
|
||||||
|
import statistics
|
||||||
import math
|
import math
|
||||||
import csv
|
import csv
|
||||||
import msgpack
|
import msgpack
|
||||||
import gzip
|
import gzip
|
||||||
|
import unicodedata
|
||||||
import regex
|
import regex
|
||||||
|
|
||||||
|
|
||||||
@ -36,6 +38,28 @@ def count_tokens(filename):
|
|||||||
return counts
|
return counts
|
||||||
|
|
||||||
|
|
||||||
|
def count_tokens_langtagged(filename, lang):
|
||||||
|
"""
|
||||||
|
Count tokens that appear in an already language-tagged file, in which each
|
||||||
|
line begins with a language code followed by a tab.
|
||||||
|
"""
|
||||||
|
counts = defaultdict(int)
|
||||||
|
if filename.endswith('gz'):
|
||||||
|
infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
|
||||||
|
else:
|
||||||
|
infile = open(filename, encoding='utf-8', errors='replace')
|
||||||
|
for line in infile:
|
||||||
|
if '\t' not in line:
|
||||||
|
continue
|
||||||
|
line_lang, text = line.split('\t', 1)
|
||||||
|
if line_lang == lang:
|
||||||
|
tokens = tokenize(text.strip(), lang)
|
||||||
|
for token in tokens:
|
||||||
|
counts[token] += 1
|
||||||
|
infile.close()
|
||||||
|
return counts
|
||||||
|
|
||||||
|
|
||||||
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
|
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
|
||||||
"""
|
"""
|
||||||
Read words and their frequency or count values from a CSV file. Returns
|
Read words and their frequency or count values from a CSV file. Returns
|
||||||
@ -137,7 +161,7 @@ def merge_counts(count_dicts):
|
|||||||
def merge_freqs(freq_dicts):
|
def merge_freqs(freq_dicts):
|
||||||
"""
|
"""
|
||||||
Merge multiple dictionaries of frequencies, representing each word with
|
Merge multiple dictionaries of frequencies, representing each word with
|
||||||
the word's average frequency over all sources.
|
the median of the word's frequency over all sources.
|
||||||
"""
|
"""
|
||||||
vocab = set()
|
vocab = set()
|
||||||
for freq_dict in freq_dicts:
|
for freq_dict in freq_dicts:
|
||||||
@ -146,15 +170,45 @@ def merge_freqs(freq_dicts):
|
|||||||
merged = defaultdict(float)
|
merged = defaultdict(float)
|
||||||
N = len(freq_dicts)
|
N = len(freq_dicts)
|
||||||
for term in vocab:
|
for term in vocab:
|
||||||
term_total = 0.
|
freqs = []
|
||||||
|
missing_values = 0
|
||||||
for freq_dict in freq_dicts:
|
for freq_dict in freq_dicts:
|
||||||
term_total += freq_dict.get(term, 0.)
|
freq = freq_dict.get(term, 0.)
|
||||||
merged[term] = term_total / N
|
if freq < 1e-8:
|
||||||
|
# Usually we trust the median of the wordlists, but when at
|
||||||
|
# least 2 wordlists say a word exists and the rest say it
|
||||||
|
# doesn't, we kind of want to listen to the two that have
|
||||||
|
# information about the word. The word might be a word that's
|
||||||
|
# inconsistently accounted for, such as an emoji or a word
|
||||||
|
# containing an apostrophe.
|
||||||
|
#
|
||||||
|
# So, once we see at least 2 values that are very low or
|
||||||
|
# missing, we ignore further low values in the median. A word
|
||||||
|
# that appears in 2 sources gets a reasonable frequency, while
|
||||||
|
# a word that appears in 1 source still gets dropped.
|
||||||
|
|
||||||
|
missing_values += 1
|
||||||
|
if missing_values > 2:
|
||||||
|
continue
|
||||||
|
|
||||||
|
freqs.append(freq)
|
||||||
|
|
||||||
|
if freqs:
|
||||||
|
median = statistics.median(freqs)
|
||||||
|
if median > 0.:
|
||||||
|
merged[term] = median
|
||||||
|
|
||||||
|
total = sum(merged.values())
|
||||||
|
|
||||||
|
# Normalize the merged values so that they add up to 0.99 (based on
|
||||||
|
# a rough estimate that 1% of tokens will be out-of-vocabulary in a
|
||||||
|
# wordlist of this size).
|
||||||
|
for term in merged:
|
||||||
|
merged[term] = merged[term] / total * 0.99
|
||||||
return merged
|
return merged
|
||||||
|
|
||||||
|
|
||||||
def write_wordlist(freqs, filename, cutoff=1e-8):
|
def write_wordlist(freqs, filename, cutoff=1e-9):
|
||||||
"""
|
"""
|
||||||
Write a dictionary of either raw counts or frequencies to a file of
|
Write a dictionary of either raw counts or frequencies to a file of
|
||||||
comma-separated values.
|
comma-separated values.
|
||||||
@ -226,7 +280,6 @@ def correct_apostrophe_trimming(freqs):
|
|||||||
removed.
|
removed.
|
||||||
"""
|
"""
|
||||||
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
|
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
|
||||||
print("Applying apostrophe trimming")
|
|
||||||
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
|
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
|
||||||
if trim_word in freqs:
|
if trim_word in freqs:
|
||||||
freq = freqs[trim_word]
|
freq = freqs[trim_word]
|
||||||
|
Loading…
Reference in New Issue
Block a user