Add Common Crawl data and more languages (#39)

This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
This commit is contained in:
Robyn Speer 2016-07-28 19:23:17 -04:00 committed by Lance Nathan
parent 0a2bfb2710
commit 2a41d4dc5e
68 changed files with 24828 additions and 36204 deletions

114
README.md
View File

@ -60,16 +60,16 @@ frequencies by a million (1e6) to get more readable numbers:
>>> from wordfreq import word_frequency >>> from wordfreq import word_frequency
>>> word_frequency('cafe', 'en') * 1e6 >>> word_frequency('cafe', 'en') * 1e6
14.45439770745928 12.88249551693135
>>> word_frequency('café', 'en') * 1e6 >>> word_frequency('café', 'en') * 1e6
4.7863009232263805 3.3884415613920273
>>> word_frequency('cafe', 'fr') * 1e6 >>> word_frequency('cafe', 'fr') * 1e6
2.0417379446695274 2.6302679918953817
>>> word_frequency('café', 'fr') * 1e6 >>> word_frequency('café', 'fr') * 1e6
77.62471166286912 87.09635899560814
`zipf_frequency` is a variation on `word_frequency` that aims to return the `zipf_frequency` is a variation on `word_frequency` that aims to return the
@ -85,20 +85,21 @@ described above, the minimum Zipf value appearing in these lists is 1.0 for the
for words that do not appear in the given wordlist, although it should mean for words that do not appear in the given wordlist, although it should mean
one occurrence per billion words. one occurrence per billion words.
>>> from wordfreq import zipf_frequency
>>> zipf_frequency('the', 'en') >>> zipf_frequency('the', 'en')
7.59 7.67
>>> zipf_frequency('word', 'en') >>> zipf_frequency('word', 'en')
5.34 5.39
>>> zipf_frequency('frequency', 'en') >>> zipf_frequency('frequency', 'en')
4.44 4.19
>>> zipf_frequency('zipf', 'en') >>> zipf_frequency('zipf', 'en')
0.0 0.0
>>> zipf_frequency('zipf', 'en', wordlist='large') >>> zipf_frequency('zipf', 'en', wordlist='large')
1.42 1.65
The parameters to `word_frequency` and `zipf_frequency` are: The parameters to `word_frequency` and `zipf_frequency` are:
@ -128,10 +129,10 @@ the list, in descending frequency order.
>>> from wordfreq import top_n_list >>> from wordfreq import top_n_list
>>> top_n_list('en', 10) >>> top_n_list('en', 10)
['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it'] ['the', 'i', 'to', 'a', 'and', 'of', 'you', 'in', 'that', 'is']
>>> top_n_list('es', 10) >>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es'] ['de', 'que', 'la', 'y', 'a', 'en', 'el', 'no', 'los', 'es']
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a `iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
wordlist, in descending frequency order. wordlist, in descending frequency order.
@ -168,48 +169,56 @@ The sources (and the abbreviations we'll use for them) are:
- **Twitter**: Messages sampled from Twitter's public stream - **Twitter**: Messages sampled from Twitter's public stream
- **Wpedia**: The full text of Wikipedia in 2015 - **Wpedia**: The full text of Wikipedia in 2015
- **Reddit**: The corpus of Reddit comments through May 2015 - **Reddit**: The corpus of Reddit comments through May 2015
- **CCrawl**: Text extracted from the Common Crawl and language-detected with cld2
- **Other**: We get additional English frequencies from Google Books Syntactic - **Other**: We get additional English frequencies from Google Books Syntactic
Ngrams 2013, and Chinese frequencies from the frequency dictionary that Ngrams 2013, and Chinese frequencies from the frequency dictionary that
comes with the Jieba tokenizer. comes with the Jieba tokenizer.
The following 17 languages are well-supported, with reasonable tokenization and The following 27 languages are supported, with reasonable tokenization and at
at least 3 different sources of word frequencies: least 3 different sources of word frequencies:
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other Language Code Sources Large? SUBTLEX OpenSub LeedsIC Twitter Wpedia CCrawl Reddit Other
──────────────────┼───────────────────────────────────────────────────── ───────────────────────────────────┼──────────────────────────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes - - Arabic ar 5 Yes │ - Yes Yes Yes Yes Yes - -
German de │ Yes - Yes Yes[1] Yes - - Bulgarian bg 3 - │ - Yes - - Yes Yes - -
Greek el │ - Yes Yes Yes Yes - - Catalan ca 3 - │ - Yes - Yes Yes - - -
English en │ Yes Yes Yes Yes Yes Yes Google Books Danish da 3 - │ - Yes - - Yes Yes - -
Spanish es │ - Yes Yes Yes Yes - - German de 5 Yes │ Yes - Yes Yes Yes Yes - -
French fr │ - Yes Yes Yes Yes - - Greek el 4 - │ - Yes Yes - Yes Yes - -
Indonesian id │ - Yes - Yes Yes - - English en 7 Yes │ Yes Yes Yes Yes Yes - Yes Google Books
Italian it │ - Yes Yes Yes Yes - - Spanish es 6 Yes │ - Yes Yes Yes Yes Yes Yes -
Japanese ja │ - - Yes Yes Yes - - Finnish fi 3 - │ - Yes - - Yes Yes - -
Malay ms │ - Yes - Yes Yes - - French fr 5 Yes │ - Yes Yes Yes Yes Yes - -
Dutch nl │ Yes Yes - Yes Yes - - Hebrew he 4 - │ - Yes - Yes Yes Yes - -
Polish pl │ - Yes - Yes Yes - - Hindi hi 3 - │ - - - Yes Yes Yes - -
Portuguese pt │ - Yes Yes Yes Yes - - Hungarian hu 3 - │ - Yes - - Yes Yes - -
Russian ru │ - Yes Yes Yes Yes - - Indonesian id 4 - │ - Yes - Yes Yes Yes - -
Swedish sv │ - Yes - Yes Yes - - Italian it 5 Yes │ - Yes Yes Yes Yes Yes - -
Turkish tr │ - Yes - Yes Yes - - Japanese ja 4 - │ - - Yes Yes Yes Yes - -
Chinese zh │ Yes - Yes - - - Jieba Korean ko 3 - │ - - - Yes Yes Yes - -
Malay ms 4 - │ - Yes - Yes Yes Yes - -
Norwegian nb[1] 3 - │ - Yes - - Yes Yes - -
Dutch nl 5 Yes │ Yes Yes - Yes Yes Yes - -
Polish pl 4 - │ - Yes - Yes Yes Yes - -
Portuguese pt 5 Yes │ - Yes Yes Yes Yes Yes - -
Romanian ro 3 - │ - Yes - - Yes Yes - -
Russian ru 5 Yes │ - Yes Yes Yes Yes Yes - -
Swedish sv 4 - │ - Yes - Yes Yes Yes - -
Turkish tr 4 - │ - Yes - Yes Yes Yes - -
Chinese zh[2] 5 - │ Yes - Yes - Yes Yes - Jieba
[1] The Norwegian text we have is specifically written in Norwegian Bokmål, so
we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
isn't enough data to include it in wordfreq.
Additionally, Korean is marginally supported. You can look up frequencies in [2] This data represents text written in both Simplified and Traditional
it, but it will be insufficiently tokenized into words, and we have too few Chinese. (SUBTLEX is mostly Simplified, while Wikipedia is mostly Traditional.)
data sources for it so far: The characters are mapped to one another so they can use the same word
frequency list.
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Some languages provide 'large' wordlists, including words with a Zipf frequency
──────────────────┼─────────────────────────────────────────────── between 1.0 and 3.0. These are available in 9 languages that are covered by
Korean ko │ - - - Yes Yes - enough data sources.
The 'large' wordlists are available in English, German, Spanish, French, and
Portuguese.
[1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter.
Germans just don't tweet that much.
## Tokenization ## Tokenization
@ -223,10 +232,13 @@ splits words between apostrophes and vowels.
There are language-specific exceptions: There are language-specific exceptions:
- In Arabic, it additionally normalizes ligatures and removes combining marks. - In Arabic and Hebrew, it additionally normalizes ligatures and removes
- In Japanese, instead of using the regex library, it uses the external library combining marks.
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
it requires the `libmecab-dev` system package to be installed. - In Japanese and Korean, instead of using the regex library, it uses the
external library `mecab-python3`. This is an optional dependency of wordfreq,
and compiling it requires the `libmecab-dev` system package to be installed.
- In Chinese, it uses the external Python library `jieba`, another optional - In Chinese, it uses the external Python library `jieba`, another optional
dependency. dependency.
@ -240,9 +252,9 @@ also try to deal gracefully when you query it with texts that actually break
into multiple tokens: into multiple tokens:
>>> zipf_frequency('New York', 'en') >>> zipf_frequency('New York', 'en')
5.31 5.07
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway" >>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.51 3.58
The word frequencies are combined with the half-harmonic-mean function in order The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese, to provide an estimate of what their combined frequency would be. In Chinese,
@ -257,7 +269,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
their frequency: their frequency:
>>> zipf_frequency('owl-flavored', 'en') >>> zipf_frequency('owl-flavored', 'en')
3.18 3.19
## License ## License

View File

@ -34,7 +34,7 @@ if sys.version_info < (3, 4):
setup( setup(
name="wordfreq", name="wordfreq",
version='1.4.2', version='1.5',
maintainer='Luminoso Technologies, Inc.', maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com', maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq/', url='http://github.com/LuminosoInsight/wordfreq/',

View File

@ -19,23 +19,43 @@ def test_freq_examples():
def test_languages(): def test_languages():
# Make sure the number of available languages doesn't decrease # Make sure the number of available languages doesn't decrease
avail = available_languages() avail = available_languages()
assert_greater(len(avail), 15) assert_greater(len(avail), 26)
avail_twitter = available_languages('twitter')
assert_greater(len(avail_twitter), 15)
# Look up a word representing laughter in each language, and make sure # Look up a word representing laughter in each language, and make sure
# it has a non-zero frequency. # it has a non-zero frequency in the informal 'twitter' list.
for lang in avail: for lang in avail_twitter:
if lang in {'zh', 'ja'}: if lang == 'zh' or lang == 'ja':
text = '' text = ''
elif lang == 'ko':
text = 'ᄏᄏᄏ'
elif lang == 'ar': elif lang == 'ar':
text = 'ههههه' text = 'ههههه'
elif lang == 'ca' or lang == 'es':
text = 'jaja'
elif lang in {'de', 'nb', 'sv', 'da'}:
text = 'haha'
elif lang == 'pt':
text = 'kkkk'
elif lang == 'he':
text = 'חחח'
elif lang == 'ru':
text = 'лол'
elif lang == 'bg':
text = 'хаха'
elif lang == 'ro':
text = 'haha'
elif lang == 'el':
text = 'χαχα'
else: else:
text = 'lol' text = 'lol'
assert_greater(word_frequency(text, lang), 0) assert_greater(word_frequency(text, lang, wordlist='twitter'), 0, (text, lang))
# Make up a weirdly verbose language code and make sure # Make up a weirdly verbose language code and make sure
# we still get it # we still get it
new_lang_code = '%s-001-x-fake-extension' % lang.upper() new_lang_code = '%s-001-x-fake-extension' % lang.upper()
assert_greater(word_frequency(text, new_lang_code), 0, (text, new_lang_code)) assert_greater(word_frequency(text, new_lang_code, wordlist='twitter'), 0, (text, new_lang_code))
def test_twitter(): def test_twitter():
@ -62,7 +82,7 @@ def test_most_common_words():
""" """
return top_n_list(lang, 1)[0] return top_n_list(lang, 1)[0]
eq_(get_most_common('ar'), 'في') eq_(get_most_common('ar'), 'من')
eq_(get_most_common('de'), 'die') eq_(get_most_common('de'), 'die')
eq_(get_most_common('en'), 'the') eq_(get_most_common('en'), 'the')
eq_(get_most_common('es'), 'de') eq_(get_most_common('es'), 'de')
@ -144,12 +164,12 @@ def test_not_really_random():
# This not only tests random_ascii_words, it makes sure we didn't end # This not only tests random_ascii_words, it makes sure we didn't end
# up with 'eos' as a very common Japanese word # up with 'eos' as a very common Japanese word
eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0), eq_(random_ascii_words(nwords=4, lang='ja', bits_per_word=0),
'rt rt rt rt') '1 1 1 1')
@raises(ValueError) @raises(ValueError)
def test_not_enough_ascii(): def test_not_enough_ascii():
random_ascii_words(lang='zh') random_ascii_words(lang='zh', bits_per_word=14)
def test_arabic(): def test_arabic():
@ -199,3 +219,10 @@ def test_other_languages():
# Remove vowel points in Hebrew # Remove vowel points in Hebrew
eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה']) eq_(tokenize('דֻּגְמָה', 'he'), ['דגמה'])
# Deal with commas, cedillas, and I's in Turkish
eq_(tokenize('kișinin', 'tr'), ['kişinin'])
eq_(tokenize('KİȘİNİN', 'tr'), ['kişinin'])
# Deal with cedillas that should be commas-below in Romanian
eq_(tokenize('acelaşi', 'ro'), ['același'])
eq_(tokenize('ACELAŞI', 'ro'), ['același'])

View File

@ -282,15 +282,15 @@ def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
""" """
Get the frequency of `word`, in the language with code `lang`, on the Zipf Get the frequency of `word`, in the language with code `lang`, on the Zipf
scale. scale.
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert, The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
who compiled the SUBTLEX data. The goal of the Zipf scale is to map who compiled the SUBTLEX data. The goal of the Zipf scale is to map
reasonable word frequencies to understandable, small positive numbers. reasonable word frequencies to understandable, small positive numbers.
A word rates as x on the Zipf scale when it occurs 10**x times per billion A word rates as x on the Zipf scale when it occurs 10**x times per billion
words. For example, a word that occurs once per million words is at 3.0 on words. For example, a word that occurs once per million words is at 3.0 on
the Zipf scale. the Zipf scale.
Zipf values for reasonable words are between 0 and 8. The value this Zipf values for reasonable words are between 0 and 8. The value this
function returns will always be at last as large as `minimum`, even for a function returns will always be at last as large as `minimum`, even for a
word that never appears. The default minimum is 0, representing words word that never appears. The default minimum is 0, representing words

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -23,6 +23,9 @@ def mecab_tokenize(text, lang):
raise ValueError("Can't run MeCab on language %r" % lang) raise ValueError("Can't run MeCab on language %r" % lang)
analyzer = MECAB_ANALYZERS[lang] analyzer = MECAB_ANALYZERS[lang]
text = unicodedata.normalize('NFKC', text.strip()) text = unicodedata.normalize('NFKC', text.strip())
analyzed = analyzer.parse(text)
if not analyzed:
return []
return [line.split('\t')[0] return [line.split('\t')[0]
for line in analyzer.parse(text).split('\n') for line in analyzed.split('\n')
if line != '' and line != 'EOS'] if line != '' and line != 'EOS']

View File

@ -116,11 +116,26 @@ def simple_tokenize(text, include_punctuation=False):
def turkish_tokenize(text, include_punctuation=False): def turkish_tokenize(text, include_punctuation=False):
""" """
Like `simple_tokenize`, but modifies i's so that they case-fold correctly Like `simple_tokenize`, but modifies i's so that they case-fold correctly
in Turkish. in Turkish, and modifies 'comma-below' characters to use cedillas.
""" """
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı') text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
return [token.strip("'").casefold() for token in token_expr.findall(text)] return [
commas_to_cedillas(token.strip("'").casefold())
for token in token_expr.findall(text)
]
def romanian_tokenize(text, include_punctuation=False):
"""
Like `simple_tokenize`, but modifies the letters ş and ţ (with cedillas)
to use commas-below instead.
"""
token_expr = TOKEN_RE_WITH_PUNCTUATION if include_punctuation else TOKEN_RE
return [
cedillas_to_commas(token.strip("'").casefold())
for token in token_expr.findall(text)
]
def tokenize_mecab_language(text, lang, include_punctuation=False): def tokenize_mecab_language(text, lang, include_punctuation=False):
@ -161,6 +176,34 @@ def remove_marks(text):
return MARK_RE.sub('', text) return MARK_RE.sub('', text)
def commas_to_cedillas(text):
"""
Convert s and t with commas (ș and ț) to cedillas (ş and ţ), which is
preferred in Turkish.
"""
return text.replace(
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}',
'\N{LATIN SMALL LETTER S WITH CEDILLA}'
).replace(
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}',
'\N{LATIN SMALL LETTER T WITH CEDILLA}'
)
def cedillas_to_commas(text):
"""
Convert s and t with cedillas (ş and ţ) to commas (ș and ț), which is
preferred in Romanian.
"""
return text.replace(
'\N{LATIN SMALL LETTER S WITH CEDILLA}',
'\N{LATIN SMALL LETTER S WITH COMMA BELOW}'
).replace(
'\N{LATIN SMALL LETTER T WITH CEDILLA}',
'\N{LATIN SMALL LETTER T WITH COMMA BELOW}'
)
def tokenize(text, lang, include_punctuation=False, external_wordlist=False): def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
""" """
Tokenize this text in a way that's relatively simple but appropriate for Tokenize this text in a way that's relatively simple but appropriate for
@ -263,6 +306,8 @@ def tokenize(text, lang, include_punctuation=False, external_wordlist=False):
return chinese_tokenize(text, include_punctuation, external_wordlist) return chinese_tokenize(text, include_punctuation, external_wordlist)
elif lang == 'tr': elif lang == 'tr':
return turkish_tokenize(text, include_punctuation) return turkish_tokenize(text, include_punctuation)
elif lang == 'ro':
return romanian_tokenize(text, include_punctuation)
elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}: elif lang in {'ar', 'bal', 'fa', 'ku', 'ps', 'sd', 'tk', 'ug', 'ur', 'he', 'yi'}:
# Abjad languages # Abjad languages
text = remove_marks(unicodedata.normalize('NFKC', text)) text = remove_marks(unicodedata.normalize('NFKC', text))

View File

@ -91,6 +91,9 @@ rule convert_google_syntactic_ngrams
rule count rule count
command = python -m wordfreq_builder.cli.count_tokens $in $out command = python -m wordfreq_builder.cli.count_tokens $in $out
rule count_langtagged
command = python -m wordfreq_builder.cli.count_tokens_langtagged $in $out -l $language
rule merge rule merge
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in

View File

@ -0,0 +1,21 @@
"""
Count tokens of text in a particular language, taking input from a
tab-separated file whose first column is a language code. Lines in all
languages except the specified one will be skipped.
"""
from wordfreq_builder.word_counts import count_tokens_langtagged, write_wordlist
import argparse
def handle_counts(filename_in, filename_out, lang):
counts = count_tokens_langtagged(filename_in, lang)
write_wordlist(counts, filename_out)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
parser.add_argument('-l', '--language', help='language tag to filter lines for')
args = parser.parse_args()
handle_counts(args.filename_in, args.filename_out, args.language)

View File

@ -10,15 +10,20 @@ CONFIG = {
# #
# Consider adding: # Consider adding:
# 'th' when we get tokenization for it # 'th' when we get tokenization for it
# 'hi' when we stop messing up its tokenization
# 'tl' with one more data source # 'tl' with one more data source
# 'el' if we can filter out kaomoji
'twitter': [ 'twitter': [
'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl', 'ar', 'ca', 'de', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
'pl', 'pt', 'ru', 'sv', 'tr' 'ja', 'ko', 'ms', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr'
], ],
# Languages with large Wikipedias. (Languages whose Wikipedia dump is
# at least 200 MB of .xml.bz2 are included. Some widely-spoken
# languages with 100 MB are also included, specifically Malay and
# Hindi.)
'wikipedia': [ 'wikipedia': [
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl', 'ar', 'ca', 'de', 'el', 'en', 'es', 'fr', 'he', 'hi', 'id', 'it',
'pl', 'pt', 'ru', 'sv', 'tr' 'ja', 'ko', 'ms', 'nb', 'nl', 'pl', 'pt', 'ru', 'sv', 'tr', 'zh',
'bg', 'da', 'fi', 'hu', 'ro', 'uk'
], ],
'opensubtitles': [ 'opensubtitles': [
# This list includes languages where the most common word in # This list includes languages where the most common word in
@ -43,9 +48,20 @@ CONFIG = {
'jieba': ['zh'], 'jieba': ['zh'],
# About 99.2% of Reddit is in English. There are pockets of # About 99.2% of Reddit is in English. There are pockets of
# conversation in other languages, but we're concerned that they're not # conversation in other languages, some of which may not be
# representative enough for learning general word frequencies. # representative enough for learning general word frequencies.
'reddit': ['en'] #
# However, there seem to be Spanish subreddits that are general enough
# (including /r/es and /r/mexico).
'reddit': ['en', 'es'],
# Well-represented languages in the Common Crawl
# It's possible we could add 'uk' to the list, needs more checking
'commoncrawl': [
'ar', 'bg', 'cs', 'da', 'de', 'el', 'es', 'fa', 'fi', 'fr',
'he', 'hi', 'hu', 'id', 'it', 'ja', 'ko', 'ms', 'nb', 'nl',
'pl', 'pt', 'ro', 'ru', 'sk', 'sv', 'ta', 'tr', 'vi', 'zh'
],
}, },
# Subtlex languages that need to be pre-processed # Subtlex languages that need to be pre-processed
'wordlist_paths': { 'wordlist_paths': {
@ -54,6 +70,7 @@ CONFIG = {
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}', 'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}', 'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
'google-books': 'generated/google-books/google_books_{lang}.{ext}', 'google-books': 'generated/google-books/google_books_{lang}.{ext}',
'commoncrawl': 'generated/commoncrawl/commoncrawl_{lang}.{ext}',
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}', 'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}', 'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
'jieba': 'generated/jieba/jieba_{lang}.{ext}', 'jieba': 'generated/jieba/jieba_{lang}.{ext}',
@ -64,8 +81,15 @@ CONFIG = {
'twitter-dist': 'dist/twitter_{lang}.{ext}', 'twitter-dist': 'dist/twitter_{lang}.{ext}',
'jieba-dist': 'dist/jieba_{lang}.{ext}' 'jieba-dist': 'dist/jieba_{lang}.{ext}'
}, },
'min_sources': 2, 'min_sources': 3,
'big-lists': ['en', 'fr', 'es', 'pt', 'de'] 'big-lists': ['en', 'fr', 'es', 'pt', 'de', 'ar', 'it', 'nl', 'ru'],
# When dealing with language tags that come straight from cld2, we need
# to un-standardize a few of them
'cld2-language-aliases': {
'nb': 'no',
'he': 'iw',
'jw': 'jv'
}
} }

View File

@ -87,6 +87,10 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
data_filename('source-lists/jieba'), data_filename('source-lists/jieba'),
CONFIG['sources']['jieba'] CONFIG['sources']['jieba']
), ),
commoncrawl_deps(
data_filename('raw-input/commoncrawl'),
CONFIG['sources']['commoncrawl']
),
combine_lists(all_languages()) combine_lists(all_languages())
)) ))
@ -117,6 +121,19 @@ def wikipedia_deps(dirname_in, languages):
return lines return lines
def commoncrawl_deps(dirname_in, languages):
lines = []
for language in languages:
if language in CONFIG['cld2-language-aliases']:
language_alias = CONFIG['cld2-language-aliases'][language]
else:
language_alias = language
input_file = dirname_in + '/{}.txt.gz'.format(language_alias)
count_file = wordlist_filename('commoncrawl', language, 'counts.txt')
add_dep(lines, 'count_langtagged', input_file, count_file, params={'language': language_alias})
return lines
def google_books_deps(dirname_in): def google_books_deps(dirname_in):
# Get English data from the split-up files of the Google Syntactic N-grams # Get English data from the split-up files of the Google Syntactic N-grams
# 2013 corpus. # 2013 corpus.

View File

@ -2,10 +2,12 @@ from wordfreq import simple_tokenize, tokenize
from collections import defaultdict from collections import defaultdict
from operator import itemgetter from operator import itemgetter
from ftfy import fix_text from ftfy import fix_text
import statistics
import math import math
import csv import csv
import msgpack import msgpack
import gzip import gzip
import unicodedata
import regex import regex
@ -36,6 +38,28 @@ def count_tokens(filename):
return counts return counts
def count_tokens_langtagged(filename, lang):
"""
Count tokens that appear in an already language-tagged file, in which each
line begins with a language code followed by a tab.
"""
counts = defaultdict(int)
if filename.endswith('gz'):
infile = gzip.open(filename, 'rt', encoding='utf-8', errors='replace')
else:
infile = open(filename, encoding='utf-8', errors='replace')
for line in infile:
if '\t' not in line:
continue
line_lang, text = line.split('\t', 1)
if line_lang == lang:
tokens = tokenize(text.strip(), lang)
for token in tokens:
counts[token] += 1
infile.close()
return counts
def read_values(filename, cutoff=0, max_words=1e8, lang=None): def read_values(filename, cutoff=0, max_words=1e8, lang=None):
""" """
Read words and their frequency or count values from a CSV file. Returns Read words and their frequency or count values from a CSV file. Returns
@ -137,7 +161,7 @@ def merge_counts(count_dicts):
def merge_freqs(freq_dicts): def merge_freqs(freq_dicts):
""" """
Merge multiple dictionaries of frequencies, representing each word with Merge multiple dictionaries of frequencies, representing each word with
the word's average frequency over all sources. the median of the word's frequency over all sources.
""" """
vocab = set() vocab = set()
for freq_dict in freq_dicts: for freq_dict in freq_dicts:
@ -146,15 +170,45 @@ def merge_freqs(freq_dicts):
merged = defaultdict(float) merged = defaultdict(float)
N = len(freq_dicts) N = len(freq_dicts)
for term in vocab: for term in vocab:
term_total = 0. freqs = []
missing_values = 0
for freq_dict in freq_dicts: for freq_dict in freq_dicts:
term_total += freq_dict.get(term, 0.) freq = freq_dict.get(term, 0.)
merged[term] = term_total / N if freq < 1e-8:
# Usually we trust the median of the wordlists, but when at
# least 2 wordlists say a word exists and the rest say it
# doesn't, we kind of want to listen to the two that have
# information about the word. The word might be a word that's
# inconsistently accounted for, such as an emoji or a word
# containing an apostrophe.
#
# So, once we see at least 2 values that are very low or
# missing, we ignore further low values in the median. A word
# that appears in 2 sources gets a reasonable frequency, while
# a word that appears in 1 source still gets dropped.
missing_values += 1
if missing_values > 2:
continue
freqs.append(freq)
if freqs:
median = statistics.median(freqs)
if median > 0.:
merged[term] = median
total = sum(merged.values())
# Normalize the merged values so that they add up to 0.99 (based on
# a rough estimate that 1% of tokens will be out-of-vocabulary in a
# wordlist of this size).
for term in merged:
merged[term] = merged[term] / total * 0.99
return merged return merged
def write_wordlist(freqs, filename, cutoff=1e-8): def write_wordlist(freqs, filename, cutoff=1e-9):
""" """
Write a dictionary of either raw counts or frequencies to a file of Write a dictionary of either raw counts or frequencies to a file of
comma-separated values. comma-separated values.
@ -226,7 +280,6 @@ def correct_apostrophe_trimming(freqs):
removed. removed.
""" """
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6): if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
print("Applying apostrophe trimming")
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items(): for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
if trim_word in freqs: if trim_word in freqs:
freq = freqs[trim_word] freq = freqs[trim_word]