mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
This commit is contained in:
parent
a0893af82e
commit
9758c69ff0
114
README.md
114
README.md
@ -60,16 +60,16 @@ frequencies by a million (1e6) to get more readable numbers:
|
||||
|
||||
>>> from wordfreq import word_frequency
|
||||
>>> word_frequency('cafe', 'en') * 1e6
|
||||
14.45439770745928
|
||||
12.88249551693135
|
||||
|
||||
>>> word_frequency('café', 'en') * 1e6
|
||||
4.7863009232263805
|
||||
3.3884415613920273
|
||||
|
||||
>>> word_frequency('cafe', 'fr') * 1e6
|
||||
2.0417379446695274
|
||||
2.6302679918953817
|
||||
|
||||
>>> word_frequency('café', 'fr') * 1e6
|
||||
77.62471166286912
|
||||
87.09635899560814
|
||||
|
||||
|
||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||
@ -85,20 +85,21 @@ described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||
for words that do not appear in the given wordlist, although it should mean
|
||||
one occurrence per billion words.
|
||||
|
||||
>>> from wordfreq import zipf_frequency
|
||||
>>> zipf_frequency('the', 'en')
|
||||
7.59
|
||||
7.67
|
||||
|
||||
>>> zipf_frequency('word', 'en')
|
||||
5.34
|
||||
5.39
|
||||
|
||||
>>> zipf_frequency('frequency', 'en')
|
||||
4.44
|
||||
4.19
|
||||
|
||||
>>> zipf_frequency('zipf', 'en')
|
||||
0.0
|
||||
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
||||
1.42
|
||||
1.65
|
||||
|
||||
|
||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
@ -128,10 +129,10 @@ the list, in descending frequency order.
|
||||
|
||||
>>> from wordfreq import top_n_list
|
||||
>>> top_n_list('en', 10)
|
||||
['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']
|
||||
['the', 'i', 'to', 'a', 'and', 'of', 'you', 'in', 'that', 'is']
|
||||
|
||||
>>> top_n_list('es', 10)
|
||||
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']
|
||||
['de', 'que', 'la', 'y', 'a', 'en', 'el', 'no', 'los', 'es']
|
||||
|
||||
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
|
||||
wordlist, in descending frequency order.
|
||||
@ -168,48 +169,56 @@ The sources (and the abbreviations we'll use for them) are:
|
||||
- **Twitter**: Messages sampled from Twitter's public stream
|
||||
- **Wpedia**: The full text of Wikipedia in 2015
|
||||
- **Reddit**: The corpus of Reddit comments through May 2015
|
||||
- **CCrawl**: Text extracted from the Common Crawl and language-detected with cld2
|
||||
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||
comes with the Jieba tokenizer.
|
||||
|
||||
The following 17 languages are well-supported, with reasonable tokenization and
|
||||
at least 3 different sources of word frequencies:
|
||||
The following 27 languages are supported, with reasonable tokenization and at
|
||||
least 3 different sources of word frequencies:
|
||||
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other
|
||||
──────────────────┼─────────────────────────────────────────────────────
|
||||
Arabic ar │ - Yes Yes Yes Yes - -
|
||||
German de │ Yes - Yes Yes[1] Yes - -
|
||||
Greek el │ - Yes Yes Yes Yes - -
|
||||
English en │ Yes Yes Yes Yes Yes Yes Google Books
|
||||
Spanish es │ - Yes Yes Yes Yes - -
|
||||
French fr │ - Yes Yes Yes Yes - -
|
||||
Indonesian id │ - Yes - Yes Yes - -
|
||||
Italian it │ - Yes Yes Yes Yes - -
|
||||
Japanese ja │ - - Yes Yes Yes - -
|
||||
Malay ms │ - Yes - Yes Yes - -
|
||||
Dutch nl │ Yes Yes - Yes Yes - -
|
||||
Polish pl │ - Yes - Yes Yes - -
|
||||
Portuguese pt │ - Yes Yes Yes Yes - -
|
||||
Russian ru │ - Yes Yes Yes Yes - -
|
||||
Swedish sv │ - Yes - Yes Yes - -
|
||||
Turkish tr │ - Yes - Yes Yes - -
|
||||
Chinese zh │ Yes - Yes - - - Jieba
|
||||
Language Code Sources Large? SUBTLEX OpenSub LeedsIC Twitter Wpedia CCrawl Reddit Other
|
||||
───────────────────────────────────┼──────────────────────────────────────────────────────────────
|
||||
Arabic ar 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||
Bulgarian bg 3 - │ - Yes - - Yes Yes - -
|
||||
Catalan ca 3 - │ - Yes - Yes Yes - - -
|
||||
Danish da 3 - │ - Yes - - Yes Yes - -
|
||||
German de 5 Yes │ Yes - Yes Yes Yes Yes - -
|
||||
Greek el 4 - │ - Yes Yes - Yes Yes - -
|
||||
English en 7 Yes │ Yes Yes Yes Yes Yes - Yes Google Books
|
||||
Spanish es 6 Yes │ - Yes Yes Yes Yes Yes Yes -
|
||||
Finnish fi 3 - │ - Yes - - Yes Yes - -
|
||||
French fr 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||
Hebrew he 4 - │ - Yes - Yes Yes Yes - -
|
||||
Hindi hi 3 - │ - - - Yes Yes Yes - -
|
||||
Hungarian hu 3 - │ - Yes - - Yes Yes - -
|
||||
Indonesian id 4 - │ - Yes - Yes Yes Yes - -
|
||||
Italian it 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||
Japanese ja 4 - │ - - Yes Yes Yes Yes - -
|
||||
Korean ko 3 - │ - - - Yes Yes Yes - -
|
||||
Malay ms 4 - │ - Yes - Yes Yes Yes - -
|
||||
Norwegian nb[1] 3 - │ - Yes - - Yes Yes - -
|
||||
Dutch nl 5 Yes │ Yes Yes - Yes Yes Yes - -
|
||||
Polish pl 4 - │ - Yes - Yes Yes Yes - -
|
||||
Portuguese pt 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||
Romanian ro 3 - │ - Yes - - Yes Yes - -
|
||||
Russian ru 5 Yes │ - Yes Yes Yes Yes Yes - -
|
||||
Swedish sv 4 - │ - Yes - Yes Yes Yes - -
|
||||
Turkish tr 4 - │ - Yes - Yes Yes Yes - -
|
||||
Chinese zh[2] 5 - │ Yes - Yes - Yes Yes - Jieba
|
||||
|
||||
[1] The Norwegian text we have is specifically written in Norwegian Bokmål, so
|
||||
we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
|
||||
isn't enough data to include it in wordfreq.
|
||||
|
||||
Additionally, Korean is marginally supported. You can look up frequencies in
|
||||
it, but it will be insufficiently tokenized into words, and we have too few
|
||||
data sources for it so far:
|
||||
[2] This data represents text written in both Simplified and Traditional
|
||||
Chinese. (SUBTLEX is mostly Simplified, while Wikipedia is mostly Traditional.)
|
||||
The characters are mapped to one another so they can use the same word
|
||||
frequency list.
|
||||
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit
|
||||
──────────────────┼───────────────────────────────────────────────
|
||||
Korean ko │ - - - Yes Yes -
|
||||
|
||||
The 'large' wordlists are available in English, German, Spanish, French, and
|
||||
Portuguese.
|
||||
|
||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||
you should be aware that German is not a frequently-used language on Twitter.
|
||||
Germans just don't tweet that much.
|
||||
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||
between 1.0 and 3.0. These are available in 9 languages that are covered by
|
||||
enough data sources.
|
||||
|
||||
|
||||
## Tokenization
|
||||
@ -223,10 +232,13 @@ splits words between apostrophes and vowels.
|
||||
|
||||
There are language-specific exceptions:
|
||||
|
||||
- In Arabic, it additionally normalizes ligatures and removes combining marks.
|
||||
- In Japanese, instead of using the regex library, it uses the external library
|
||||
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
|
||||
it requires the `libmecab-dev` system package to be installed.
|
||||
- In Arabic and Hebrew, it additionally normalizes ligatures and removes
|
||||
combining marks.
|
||||
|
||||
- In Japanese and Korean, instead of using the regex library, it uses the
|
||||
external library `mecab-python3`. This is an optional dependency of wordfreq,
|
||||
and compiling it requires the `libmecab-dev` system package to be installed.
|
||||
|
||||
- In Chinese, it uses the external Python library `jieba`, another optional
|
||||
dependency.
|
||||
|
||||
@ -240,9 +252,9 @@ also try to deal gracefully when you query it with texts that actually break
|
||||
into multiple tokens:
|
||||
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.31
|
||||
5.07
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.51
|
||||
3.58
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
@ -257,7 +269,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
|
||||
their frequency:
|
||||
|
||||
>>> zipf_frequency('owl-flavored', 'en')
|
||||
3.18
|
||||
3.19
|
||||
|
||||
|
||||
## License
|
||||
|
2
setup.py
2
setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='1.4.2',
|
||||
version='1.5',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||
|
@ -19,23 +19,43 @@ def test_freq_examples():
|
||||
def test_languages():
|
||||
# Make sure the number of available languages doesn't decrease
|
||||
avail = available_languages()
|
||||
assert_greater(len(avail), 15)
|
||||
assert_greater(len(avail), 26)
|
||||
|
||||
avail_twitter = available_languages('twitter')
|
||||
assert_greater(len(avail_twitter), 15)
|
||||
# Look up a word representing laughter in each language, and make sure
|
||||
# it has a non-zero frequency.
|
||||
for lang in avail:
|
||||
if lang in {'zh', 'ja'}:
|
||||
# it has a non-zero frequency in the informal 'twitter' list.
|
||||
for lang in avail_twitter:
|
||||
if lang == 'zh' or lang == 'ja':
|
||||
text = '笑'
|
||||
elif lang == 'ko':
|
||||
text = 'ᄏᄏᄏ'
|
||||
elif lang == 'ar':
|
||||
text = 'ههههه'
|
||||
elif lang == 'ca' or lang == 'es':
|
||||
text = 'jaja'
|
||||
elif lang in {'de', 'nb', 'sv', 'da'}:
|
||||
text = 'haha'
|
||||
elif lang == 'pt':
|
||||
text = 'kkkk'
|
||||
elif lang == 'he':
|
||||
text = 'חחח'
|
||||
elif lang == 'ru':
|
||||
text = 'лол'
|
||||
elif lang == 'bg':
|
||||
text = 'хаха'
|
||||
elif lang == 'ro':
|
||||
text = 'haha'
|
||||
elif lang == 'el':
|
||||
text = 'χαχα'
|
||||
else:
|
||||
text = 'lol'
|
||||
assert_greater(word_frequency(text, lang), 0)
|
||||
assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
|
||||
|
||||
# Make up a weirdly verbose language code and make sure
|
||||
# we still get it
|
||||
new_lang_code = '%s-001-x-fake-extension' % lang.upper()
|
||||
assert_greater(word_frequency(text, new_lang_code), 0, (text, new_lang_code))
|
||||
assert_greater(word_frequency(text, new_lang_code, wordlist='twitter'), 0, (text, new_lang_code))
|
||||
|
||||
|
||||
def test_twitter():
|
||||
@ -62,7 +82,7 @@ def test_most_common_words():
|
||||
"""
|
||||
return top_n_list(lang, 1)[0]
|
||||
|
||||
eq_(get_most_common('ar'), 'في')
|
||||
eq_(get_most_common('ar'), 'من')
|
||||
eq_(get_most_common('de'), 'die')
|
||||
eq_(get_most_common('en'), 'the')
|
||||
eq_(get_most_common('es'), 'de')
|
||||
@ -144,12 +164,12 @@ def test_not_really_random():
|
||||
# This not only tests random_ascii_words, it makes sure we didn't end
|
||||
# up with 'eos' as a very common Japanese word
|
||||
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
|
||||
'rt rt rt rt')
|
||||
'1 1 1 1')
|
||||
|
||||
|
||||
@raises(ValueError)
|
||||
def test_not_enough_ascii():
|
||||
random_ascii_words(lang='zh')
|
||||
random_ascii_words(lang='zh', bits_per_word=14)
|
||||
|
||||
|
||||
def test_arabic():
|
||||
@ -199,3 +219,10 @@ def test_other_languages():
|
||||
# Remove vowel points in Hebrew
|
||||
eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה'])
|
||||
|
||||
# Deal with commas, cedillas, and I's in Turkish
|
||||
eq_(tokenize('kișinin', 'tr'), ['kişinin'])
|
||||
eq_(tokenize('KİȘİNİN', 'tr'), ['kişinin'])
|
||||
|
||||
# Deal with cedillas that should be commas-below in Romanian
|
||||
eq_(tokenize('acelaşi', 'ro'), ['același'])
|
||||
eq_(tokenize('ACELAŞI', 'ro'), ['același'])
|
||||
|
@ -282,15 +282,15 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
"""
|
||||
Get the frequency of `word`, in the language with code `lang`, on the Zipf
|
||||
scale.
|
||||
|
||||
|
||||
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
|
||||
who compiled the SUBTLEX data. The goal of the Zipf scale is to map
|
||||
reasonable word frequencies to understandable, small positive numbers.
|
||||
|
||||
|
||||
A word rates as x on the Zipf scale when it occurs 10**x times per billion
|
||||
words. For example, a word that occurs once per million words is at 3.0 on
|
||||
the Zipf scale.
|
||||
|
||||
|
||||
Zipf values for reasonable words are between 0 and 8. The value this
|
||||
function returns will always be at last as large as `minimum`, even for a
|
||||
word that never appears. The default minimum is 0, representing words
|
||||
|
Binary file not shown.
BIN
wordfreq/data/combined_bg.msgpack.gz
Normal file
BIN
wordfreq/data/combined_bg.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_ca.msgpack.gz
Normal file
BIN
wordfreq/data/combined_ca.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_da.msgpack.gz
Normal file
BIN
wordfreq/data/combined_da.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_fi.msgpack.gz
Normal file
BIN
wordfreq/data/combined_fi.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_he.msgpack.gz
Normal file
BIN
wordfreq/data/combined_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_hi.msgpack.gz
Normal file
BIN
wordfreq/data/combined_hi.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/combined_hu.msgpack.gz
Normal file
BIN
wordfreq/data/combined_hu.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_nb.msgpack.gz
Normal file
BIN
wordfreq/data/combined_nb.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_ro.msgpack.gz
Normal file
BIN
wordfreq/data/combined_ro.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
BIN
wordfreq/data/large_ar.msgpack.gz
Normal file
BIN
wordfreq/data/large_ar.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/large_it.msgpack.gz
Normal file
BIN
wordfreq/data/large_it.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_nl.msgpack.gz
Normal file
BIN
wordfreq/data/large_nl.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/large_ru.msgpack.gz
Normal file
BIN
wordfreq/data/large_ru.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_ca.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_ca.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_he.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_he.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/twitter_hi.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_hi.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -23,6 +23,9 @@ def mecab_tokenize(text, lang):
|
||||
raise ValueError("Can't run MeCab on language %r" % lang)
|
||||
analyzer = MECAB_ANALYZERS[lang]
|
||||
text = unicodedata.normalize('NFKC', text.strip())
|
||||
analyzed = analyzer.parse(text)
|
||||
if not analyzed:
|
||||
return []
|
||||
return [line.split('\t')[0]
|
||||
for line in analyzer.parse(text).split('\n')
|
||||
for line in analyzed.split('\n')
|
||||
if line != '' and line != 'EOS']
|
||||
|
@ -116,11 +116,26 @@ def simple_tokenize(text, include_punctuation=False):
|
||||
def turkish_tokenize(text, include_punctuation=False):
|
||||
"""
|
||||
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
|
||||
in Turkish.
|
||||
in Turkish, and modifies 'comma-below' characters to use cedillas.
|
||||
"""
|
||||
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
|
||||
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
|
||||
return [token.strip("'").casefold() for token in token_expr.findall(text)]
|
||||
return [
|
||||
commas_to_cedillas(token.strip("'").casefold())
|
||||
for token in token_expr.findall(text)
|
||||
]
|
||||
|
||||
|
||||
def romanian_tokenize(text, include_punctuation=False):
|
||||
"""
|
||||
Like `simple_tokenize`, but modifies the letters ş and ţ (with cedillas)
|
||||
to use commas-below instead.
|
||||
"""
|
||||
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
|
||||
return [
|
||||
cedillas_to_commas(token.strip("'").casefold())
|
||||
for token in token_expr.findall(text)
|
||||
]
|
||||
|
||||
|
||||
def tokenize_mecab_language(text, lang, include_punctuation=False):
|
||||
@ -161,6 +176,34 @@ def remove_marks(text):
|
||||
return MARK_RE.sub('', text)
|
||||
|
||||
|
||||
def commas_to_cedillas(text):
|
||||
"""
|
||||
Convert s and t with commas (ș and ț) to cedillas (ş and ţ), which is
|
||||
preferred in Turkish.
|
||||
"""
|
||||
return text.replace(
|
||||
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}',
|
||||
'\N{LATIN SMALL LETTER S WITH CEDILLA}'
|
||||
).replace(
|
||||
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}',
|
||||
'\N{LATIN SMALL LETTER T WITH CEDILLA}'
|
||||
)
|
||||
|
||||
|
||||
def cedillas_to_commas(text):
|
||||
"""
|
||||
Convert s and t with cedillas (ş and ţ) to commas (ș and ț), which is
|
||||
preferred in Romanian.
|
||||
"""
|
||||
return text.replace(
|
||||
'\N{LATIN SMALL LETTER S WITH CEDILLA}',
|
||||
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}'
|
||||
).replace(
|
||||
'\N{LATIN SMALL LETTER T WITH CEDILLA}',
|
||||
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}'
|
||||
)
|
||||
|
||||
|
||||
def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
||||
"""
|
||||
Tokenize this text in a way that's relatively simple but appropriate for
|
||||
@ -263,6 +306,8 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
|
||||
return chinese_tokenize(text, include_punctuation, external_wordlist)
|
||||
elif lang == 'tr':
|
||||
return turkish_tokenize(text, include_punctuation)
|
||||
elif lang == 'ro':
|
||||
return romanian_tokenize(text, include_punctuation)
|
||||
elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}:
|
||||
# Abjad languages
|
||||
text = remove_marks(unicodedata.normalize('NFKC', text))
|
||||
|
@ -91,6 +91,9 @@ rule convert_google_syntactic_ngrams
|
||||
rule count
|
||||
command = python -m wordfreq_builder.cli.count_tokens $in $out
|
||||
|
||||
rule count_langtagged
|
||||
command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
|
||||
|
||||
rule merge
|
||||
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
|
||||
|
||||
|
@ -0,0 +1,21 @@
|
||||
"""
|
||||
Count tokens of text in a particular language, taking input from a
|
||||
tab-separated file whose first column is a language code. Lines in all
|
||||
languages except the specified one will be skipped.
|
||||
"""
|
||||
from wordfreq_builder.word_counts import count_tokens_langtagged, write_wordlist
|
||||
import argparse
|
||||
|
||||
|
||||
def handle_counts(filename_in, filename_out, lang):
|
||||
counts = count_tokens_langtagged(filename_in, lang)
|
||||
write_wordlist(counts, filename_out)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||
parser.add_argument('filename_out', help='name of output file')
|
||||
parser.add_argument('-l', '--language', help='language tag to filter lines for')
|
||||
args = parser.parse_args()
|
||||
handle_counts(args.filename_in, args.filename_out, args.language)
|
@ -10,15 +10,20 @@ CONFIG = {
|
||||
#
|
||||
# Consider adding:
|
||||
# 'th' when we get tokenization for it
|
||||
# 'hi' when we stop messing up its tokenization
|
||||
# 'tl' with one more data source
|
||||
# 'el' if we can filter out kaomoji
|
||||
'twitter': [
|
||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
||||
'ar', 'ca', 'de', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
|
||||
'ja', 'ko', 'ms', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr'
|
||||
],
|
||||
# Languages with large Wikipedias. (Languages whose Wikipedia dump is
|
||||
# at least 200 MB of .xml.bz2 are included. Some widely-spoken
|
||||
# languages with 100 MB are also included, specifically Malay and
|
||||
# Hindi.)
|
||||
'wikipedia': [
|
||||
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
||||
'ar', 'ca', 'de', 'el', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
|
||||
'ja', 'ko', 'ms', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh',
|
||||
'bg', 'da', 'fi', 'hu', 'ro', 'uk'
|
||||
],
|
||||
'opensubtitles': [
|
||||
# This list includes languages where the most common word in
|
||||
@ -43,9 +48,20 @@ CONFIG = {
|
||||
'jieba': ['zh'],
|
||||
|
||||
# About 99.2% of Reddit is in English. There are pockets of
|
||||
# conversation in other languages, but we're concerned that they're not
|
||||
# conversation in other languages, some of which may not be
|
||||
# representative enough for learning general word frequencies.
|
||||
'reddit': ['en']
|
||||
#
|
||||
# However, there seem to be Spanish subreddits that are general enough
|
||||
# (including /r/es and /r/mexico).
|
||||
'reddit': ['en', 'es'],
|
||||
|
||||
# Well-represented languages in the Common Crawl
|
||||
# It's possible we could add 'uk' to the list, needs more checking
|
||||
'commoncrawl': [
|
||||
'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'fa', 'fi', 'fr',
|
||||
'he', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'ms', 'nb', 'nl',
|
||||
'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'ta', 'tr', 'vi', 'zh'
|
||||
],
|
||||
},
|
||||
# Subtlex languages that need to be pre-processed
|
||||
'wordlist_paths': {
|
||||
@ -54,6 +70,7 @@ CONFIG = {
|
||||
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
|
||||
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
|
||||
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
|
||||
'commoncrawl': 'generated/commoncrawl/commoncrawl_{lang}.{ext}',
|
||||
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
|
||||
@ -64,8 +81,15 @@ CONFIG = {
|
||||
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
||||
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
||||
},
|
||||
'min_sources': 2,
|
||||
'big-lists': ['en', 'fr', 'es', 'pt', 'de']
|
||||
'min_sources': 3,
|
||||
'big-lists': ['en', 'fr', 'es', 'pt', 'de', 'ar', 'it', 'nl', 'ru'],
|
||||
# When dealing with language tags that come straight from cld2, we need
|
||||
# to un-standardize a few of them
|
||||
'cld2-language-aliases': {
|
||||
'nb': 'no',
|
||||
'he': 'iw',
|
||||
'jw': 'jv'
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
@ -87,6 +87,10 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
|
||||
data_filename('source-lists/jieba'),
|
||||
CONFIG['sources']['jieba']
|
||||
),
|
||||
commoncrawl_deps(
|
||||
data_filename('raw-input/commoncrawl'),
|
||||
CONFIG['sources']['commoncrawl']
|
||||
),
|
||||
combine_lists(all_languages())
|
||||
))
|
||||
|
||||
@ -117,6 +121,19 @@ def wikipedia_deps(dirname_in, languages):
|
||||
return lines
|
||||
|
||||
|
||||
def commoncrawl_deps(dirname_in, languages):
|
||||
lines = []
|
||||
for language in languages:
|
||||
if language in CONFIG['cld2-language-aliases']:
|
||||
language_alias = CONFIG['cld2-language-aliases'][language]
|
||||
else:
|
||||
language_alias = language
|
||||
input_file = dirname_in + '/{}.txt.gz'.format(language_alias)
|
||||
count_file = wordlist_filename('commoncrawl', language, 'counts.txt')
|
||||
add_dep(lines, 'count_langtagged', input_file, count_file, params={'language': language_alias})
|
||||
return lines
|
||||
|
||||
|
||||
def google_books_deps(dirname_in):
|
||||
# Get English data from the split-up files of the Google Syntactic N-grams
|
||||
# 2013 corpus.
|
||||
|
@ -2,10 +2,12 @@ from wordfreq import simple_tokenize, tokenize
|
||||
from collections import defaultdict
|
||||
from operator import itemgetter
|
||||
from ftfy import fix_text
|
||||
import statistics
|
||||
import math
|
||||
import csv
|
||||
import msgpack
|
||||
import gzip
|
||||
import unicodedata
|
||||
import regex
|
||||
|
||||
|
||||
@ -36,6 +38,28 @@ def count_tokens(filename):
|
||||
return counts
|
||||
|
||||
|
||||
def count_tokens_langtagged(filename, lang):
|
||||
"""
|
||||
Count tokens that appear in an already language-tagged file, in which each
|
||||
line begins with a language code followed by a tab.
|
||||
"""
|
||||
counts = defaultdict(int)
|
||||
if filename.endswith('gz'):
|
||||
infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
|
||||
else:
|
||||
infile = open(filename, encoding='utf-8', errors='replace')
|
||||
for line in infile:
|
||||
if '\t' not in line:
|
||||
continue
|
||||
line_lang, text = line.split('\t', 1)
|
||||
if line_lang == lang:
|
||||
tokens = tokenize(text.strip(), lang)
|
||||
for token in tokens:
|
||||
counts[token] += 1
|
||||
infile.close()
|
||||
return counts
|
||||
|
||||
|
||||
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
|
||||
"""
|
||||
Read words and their frequency or count values from a CSV file. Returns
|
||||
@ -137,7 +161,7 @@ def merge_counts(count_dicts):
|
||||
def merge_freqs(freq_dicts):
|
||||
"""
|
||||
Merge multiple dictionaries of frequencies, representing each word with
|
||||
the word's average frequency over all sources.
|
||||
the median of the word's frequency over all sources.
|
||||
"""
|
||||
vocab = set()
|
||||
for freq_dict in freq_dicts:
|
||||
@ -146,15 +170,45 @@ def merge_freqs(freq_dicts):
|
||||
merged = defaultdict(float)
|
||||
N = len(freq_dicts)
|
||||
for term in vocab:
|
||||
term_total = 0.
|
||||
freqs = []
|
||||
missing_values = 0
|
||||
for freq_dict in freq_dicts:
|
||||
term_total += freq_dict.get(term, 0.)
|
||||
merged[term] = term_total / N
|
||||
freq = freq_dict.get(term, 0.)
|
||||
if freq < 1e-8:
|
||||
# Usually we trust the median of the wordlists, but when at
|
||||
# least 2 wordlists say a word exists and the rest say it
|
||||
# doesn't, we kind of want to listen to the two that have
|
||||
# information about the word. The word might be a word that's
|
||||
# inconsistently accounted for, such as an emoji or a word
|
||||
# containing an apostrophe.
|
||||
#
|
||||
# So, once we see at least 2 values that are very low or
|
||||
# missing, we ignore further low values in the median. A word
|
||||
# that appears in 2 sources gets a reasonable frequency, while
|
||||
# a word that appears in 1 source still gets dropped.
|
||||
|
||||
missing_values += 1
|
||||
if missing_values > 2:
|
||||
continue
|
||||
|
||||
freqs.append(freq)
|
||||
|
||||
if freqs:
|
||||
median = statistics.median(freqs)
|
||||
if median > 0.:
|
||||
merged[term] = median
|
||||
|
||||
total = sum(merged.values())
|
||||
|
||||
# Normalize the merged values so that they add up to 0.99 (based on
|
||||
# a rough estimate that 1% of tokens will be out-of-vocabulary in a
|
||||
# wordlist of this size).
|
||||
for term in merged:
|
||||
merged[term] = merged[term] / total * 0.99
|
||||
return merged
|
||||
|
||||
|
||||
def write_wordlist(freqs, filename, cutoff=1e-8):
|
||||
def write_wordlist(freqs, filename, cutoff=1e-9):
|
||||
"""
|
||||
Write a dictionary of either raw counts or frequencies to a file of
|
||||
comma-separated values.
|
||||
@ -226,7 +280,6 @@ def correct_apostrophe_trimming(freqs):
|
||||
removed.
|
||||
"""
|
||||
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
|
||||
print("Applying apostrophe trimming")
|
||||
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
|
||||
if trim_word in freqs:
|
||||
freq = freqs[trim_word]
|
||||
|
Loading…
Reference in New Issue
Block a user