Merge pull request #26 from LuminosoInsight/greek-and-turkish

Add SUBTLEX, support Turkish, expand Greek

Former-commit-id: acbb25e6f6
This commit is contained in:
Andrew Lin 2015-09-10 13:48:33 -04:00
commit 66f1afe4d7
46 changed files with 324 additions and 69 deletions

2
.gitignore vendored
View File

@ -7,3 +7,5 @@ pip-log.txt
.coverage .coverage
*~ *~
wordfreq-data.tar.gz wordfreq-data.tar.gz
.idea
build.dot

101
README.md
View File

@ -26,7 +26,7 @@ install them on Ubuntu:
## Usage ## Usage
wordfreq provides access to estimates of the frequency with which a word is wordfreq provides access to estimates of the frequency with which a word is
used, in 15 languages (see *Supported languages* below). It loads used, in 16 languages (see *Supported languages* below). It loads
efficiently-packed data structures that contain all words that appear at least efficiently-packed data structures that contain all words that appear at least
once per million words. once per million words.
@ -118,34 +118,38 @@ of word usage on different topics at different levels of formality. The sources
- **GBooks**: Google Books Ngrams 2013 - **GBooks**: Google Books Ngrams 2013
- **LeedsIC**: The Leeds Internet Corpus - **LeedsIC**: The Leeds Internet Corpus
- **OpenSub**: OpenSubtitles - **OpenSub**: OpenSubtitles
- **SUBTLEX**: The SUBTLEX word frequency lists
- **Twitter**: Messages sampled from Twitter's public stream - **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015 - **Wikipedia**: The full text of Wikipedia in 2015
The following 12 languages are well-supported, using at least 3 different sources The following 14 languages are well-supported, with reasonable tokenization and
of word frequencies: at least 3 different sources of word frequencies:
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼────────────────────────────────────────── ──────────────────┼──────────────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes Arabic ar │ - - Yes Yes Yes Yes
German de │ - Yes Yes Yes[1] Yes German de │ - Yes Yes - Yes[1] Yes
English en │ Yes Yes Yes Yes Yes Greek el │ - - Yes Yes Yes Yes
Spanish es │ - Yes Yes Yes Yes English en │ Yes Yes Yes Yes Yes Yes
French fr │ - Yes Yes Yes Yes Spanish es │ - - Yes Yes Yes Yes
Indonesian id │ - - Yes Yes Yes French fr │ - - Yes Yes Yes Yes
Italian it │ - Yes Yes Yes Yes Indonesian id │ - - - Yes Yes Yes
Japanese ja │ - Yes - Yes Yes Italian it │ - - Yes Yes Yes Yes
Malay ms │ - - Yes Yes Yes Japanese ja │ - - Yes - Yes Yes
Dutch nl │ - - Yes Yes Yes Malay ms │ - - - Yes Yes Yes
Portuguese pt │ - Yes Yes Yes Yes Dutch nl │ - Yes - Yes Yes Yes
Russian ru │ - Yes Yes Yes Yes Portuguese pt │ - - Yes Yes Yes Yes
Russian ru │ - - Yes Yes Yes Yes
Turkish tr │ - - - Yes Yes Yes
These 3 languages are only marginally supported so far: These languages are only marginally supported so far. We have too few data
sources so far in Korean (feel free to suggest some), and we are lacking
tokenization support for Chinese.
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼────────────────────────────────────────── ──────────────────┼──────────────────────────────────────────────────
Greek el │ - Yes Yes - - Korean ko │ - - - - Yes Yes
Korean ko │ - - - Yes Yes Chinese zh │ - Yes Yes Yes - -
Chinese zh │ - Yes Yes - -
[1] We've counted the frequencies from tweets in German, such as they are, but [1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter. you should be aware that German is not a frequently-used language on Twitter.
@ -219,7 +223,58 @@ sources:
- Wikipedia, the free encyclopedia (http://www.wikipedia.org) - Wikipedia, the free encyclopedia (http://www.wikipedia.org)
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
http://crr.ugent.be/programs-data/subtitle-frequencies.
I (Robyn Speer) have
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
in wordfreq, to be used for any purpose, not just for academic use, under these
conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
- It must remain clear that SUBTLEX is freely available data.
These terms are similar to the Creative Commons Attribution-ShareAlike license.
Some additional data was collected by a custom application that watches the Some additional data was collected by a custom application that watches the
streaming Twitter API, in accordance with Twitter's Developer Agreement & streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on Policy. This software gives statistics about words that are commonly used on
Twitter; it does not display or republish any Twitter content. Twitter; it does not display or republish any Twitter content.
## Citations to work that wordfreq is built on
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
Evaluation of Current Word Frequency Norms and the Introduction of a New and
Improved Word Frequency Measure for American English. Behavior Research
Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
(2015). The word frequency effect. Experimental Psychology.
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
- Dave, H. (2011). Frequency word lists.
https://invokeit.wordpress.com/frequency-word-lists/
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
measure for Dutch words based on film subtitles. Behavior Research Methods,
42(3), 643-650.
http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
analyzer.
http://mecab.sourceforge.net/
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

View File

@ -1,30 +1,39 @@
""" This file generates a graph of the dependencies for the ninja build.""" """ This file generates a graph of the dependencies for the ninja build."""
import sys import sys
import re
def ninja_to_dot(): def ninja_to_dot():
def last_component(path): def simplified_filename(path):
return path.split('/')[-1] component = path.split('/')[-1]
return re.sub(
r'[0-9]+-of', 'NN-of',
re.sub(r'part[0-9]+', 'partNN', component)
)
print("digraph G {") print("digraph G {")
print('rankdir="LR";') print('rankdir="LR";')
seen_edges = set()
for line in sys.stdin: for line in sys.stdin:
line = line.rstrip() line = line.rstrip()
if line.startswith('build'): if line.startswith('build'):
# the output file is the first argument; strip off the colon that # the output file is the first argument; strip off the colon that
# comes from ninja syntax # comes from ninja syntax
output_text, input_text = line.split(':') output_text, input_text = line.split(':')
outfiles = [last_component(part) for part in output_text.split(' ')[1:]] outfiles = [simplified_filename(part) for part in output_text.split(' ')[1:]]
inputs = input_text.strip().split(' ') inputs = input_text.strip().split(' ')
infiles = [last_component(part) for part in inputs[1:]] infiles = [simplified_filename(part) for part in inputs[1:]]
operation = inputs[0] operation = inputs[0]
for infile in infiles: for infile in infiles:
if infile == '|': if infile == '|':
# external dependencies start here; let's not graph those # external dependencies start here; let's not graph those
break break
for outfile in outfiles: for outfile in outfiles:
print('"%s" -> "%s" [label="%s"]' % (infile, outfile, operation)) edge = '"%s" -> "%s" [label="%s"]' % (infile, outfile, operation)
if edge not in seen_edges:
seen_edges.add(edge)
print(edge)
print("}") print("}")

View File

@ -19,7 +19,7 @@ def test_freq_examples():
def test_languages(): def test_languages():
# Make sure the number of available languages doesn't decrease # Make sure the number of available languages doesn't decrease
avail = available_languages() avail = available_languages()
assert_greater(len(avail), 14) assert_greater(len(avail), 15)
# Laughter is the universal language # Laughter is the universal language
for lang in avail: for lang in avail:
@ -36,7 +36,7 @@ def test_languages():
def test_twitter(): def test_twitter():
avail = available_languages('twitter') avail = available_languages('twitter')
assert_greater(len(avail), 12) assert_greater(len(avail), 14)
for lang in avail: for lang in avail:
assert_greater(word_frequency('rt', lang, 'twitter'), assert_greater(word_frequency('rt', lang, 'twitter'),
@ -68,6 +68,7 @@ def test_most_common_words():
eq_(get_most_common('nl'), 'de') eq_(get_most_common('nl'), 'de')
eq_(get_most_common('pt'), 'de') eq_(get_most_common('pt'), 'de')
eq_(get_most_common('ru'), 'в') eq_(get_most_common('ru'), 'в')
eq_(get_most_common('tr'), 'bir')
eq_(get_most_common('zh'), '') eq_(get_most_common('zh'), '')
@ -111,6 +112,8 @@ def test_tokenization():
def test_casefolding(): def test_casefolding():
eq_(tokenize('WEISS', 'de'), ['weiss']) eq_(tokenize('WEISS', 'de'), ['weiss'])
eq_(tokenize('weiß', 'de'), ['weiss']) eq_(tokenize('weiß', 'de'), ['weiss'])
eq_(tokenize('İstanbul', 'tr'), ['istanbul'])
eq_(tokenize('SIKISINCA', 'tr'), ['sıkısınca'])
def test_phrase_freq(): def test_phrase_freq():

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -65,6 +65,15 @@ def simple_tokenize(text):
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)] return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def turkish_tokenize(text):
"""
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
in Turkish.
"""
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def remove_arabic_marks(text): def remove_arabic_marks(text):
""" """
Remove decorations from Arabic words: Remove decorations from Arabic words:
@ -90,6 +99,8 @@ def tokenize(text, lang):
- Chinese or Japanese texts that aren't identified as the appropriate - Chinese or Japanese texts that aren't identified as the appropriate
language will only split on punctuation and script boundaries, giving language will only split on punctuation and script boundaries, giving
you untokenized globs of characters that probably represent many words. you untokenized globs of characters that probably represent many words.
- Turkish will use a different case-folding procedure, so that capital
I and İ map to ı and i respectively.
- All other languages will be tokenized using a regex that mostly - All other languages will be tokenized using a regex that mostly
implements the Word Segmentation section of Unicode Annex #29. implements the Word Segmentation section of Unicode Annex #29.
See `simple_tokenize` for details. See `simple_tokenize` for details.
@ -107,6 +118,9 @@ def tokenize(text, lang):
from wordfreq.mecab import mecab_tokenize from wordfreq.mecab import mecab_tokenize
return mecab_tokenize(text) return mecab_tokenize(text)
if lang == 'tr':
return turkish_tokenize(text)
if lang == 'ar': if lang == 'ar':
text = remove_arabic_marks(unicodedata.normalize('NFKC', text)) text = remove_arabic_marks(unicodedata.normalize('NFKC', text))

View File

@ -161,3 +161,34 @@ longer represents the words 'don' and 'won', as we assume most of their
frequency comes from "don't" and "won't". Words that turned into similarly frequency comes from "don't" and "won't". Words that turned into similarly
common words, however, were left alone: this list doesn't represent "can't" common words, however, were left alone: this list doesn't represent "can't"
because the word was left as "can". because the word was left as "can".
### SUBTLEX
Marc Brysbaert gave us permission by e-mail to use the SUBTLEX word lists in
wordfreq and derived works without the "academic use" restriction, under the
following reasonable conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
(See the citations in the top-level `README.md` file.)
- It must remain clear that SUBTLEX is freely available data.
`data/source-lists/subtlex` contains the following files:
- `subtlex.de.txt`, which was downloaded as [SUBTLEX-DE raw file.xlsx][subtlex-de],
and exported from Excel format to tab-separated UTF-8 using LibreOffice
- `subtlex.en-US.txt`, which was downloaded as [subtlexus5.zip][subtlex-us],
extracted, and converted from ISO-8859-1 to UTF-8
- `subtlex.en-GB.txt`, which was downloaded as
[SUBTLEX-UK\_all.xlsx][subtlex-uk], and exported from Excel format to
tab-separated UTF-8 using LibreOffice
- `subtlex.nl.txt`, which was downloaded as
[SUBTLEX-NL.cd-above2.txt.zip][subtlex-nl] and extracted
- `subtlex.zh.txt`, which was downloaded as
[subtlexch131210.zip][subtlex-ch] and extracted
[subtlex-de]: http://crr.ugent.be/SUBTLEX-DE/SUBTLEX-DE%20raw%20file.xlsx
[subtlex-us]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus5.zip
[subtlex-uk]: http://crr.ugent.be/papers/SUBTLEX-UK_all.xlsx
[subtlex-nl]: http://crr.ugent.be/subtlex-nl/SUBTLEX-NL.cd-above2.txt.zip
[subtlex-ch]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/subtlexch131210.zip

BIN
wordfreq_builder/build.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

View File

@ -1 +0,0 @@
ef54b21e931c530f5b75c1cd87c5841cc4691e43

View File

@ -56,6 +56,12 @@ rule convert_leeds
rule convert_opensubtitles rule convert_opensubtitles
command = tr ' ' ',' < $in > $out command = tr ' ' ',' < $in > $out
# To convert SUBTLEX, we take the 1st and Nth columns, strip the header,
# run it through ftfy, convert tabs to commas and spurious CSV formatting to
# and remove lines with unfixable half-mojibake.
rule convert_subtlex
command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr ' ",' ', ' | grep -v 'â,' > $out
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all # Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
# the input files, keep only the single words and their counts, and only keep # the input files, keep only the single words and their counts, and only keep
# lines with counts of 100 or more. # lines with counts of 100 or more.
@ -71,7 +77,10 @@ rule count
command = python -m wordfreq_builder.cli.count_tokens $in $out command = python -m wordfreq_builder.cli.count_tokens $in $out
rule merge rule merge
command = python -m wordfreq_builder.cli.combine_lists -o $out $in command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff $in
rule merge_counts
command = python -m wordfreq_builder.cli.merge_counts -o $out $in
rule freqs2cB rule freqs2cB
command = python -m wordfreq_builder.cli.freqs_to_cB $lang $in $out command = python -m wordfreq_builder.cli.freqs_to_cB $lang $in $out

View File

@ -1,12 +1,13 @@
from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist from wordfreq_builder.word_counts import read_values, merge_counts, write_wordlist
import argparse import argparse
def merge_lists(input_names, output_name): def merge_lists(input_names, output_name):
freq_dicts = [] count_dicts = []
for input_name in input_names: for input_name in input_names:
freq_dicts.append(read_freqs(input_name, cutoff=2)) values, total = read_values(input_name, cutoff=0)
merged = merge_freqs(freq_dicts) count_dicts.append(values)
merged = merge_counts(count_dicts)
write_wordlist(merged, output_name) write_wordlist(merged, output_name)

View File

@ -0,0 +1,20 @@
from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
import argparse
def merge_lists(input_names, output_name, cutoff):
freq_dicts = []
for input_name in input_names:
freq_dicts.append(read_freqs(input_name, cutoff=cutoff))
merged = merge_freqs(freq_dicts)
write_wordlist(merged, output_name)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', help='filename to write the output to', default='combined-freqs.csv')
parser.add_argument('-c', '--cutoff', type=int, help='stop after seeing a count below this', default=2)
parser.add_argument('inputs', help='names of input files to merge', nargs='+')
args = parser.parse_args()
merge_lists(args.inputs, args.output, args.cutoff)

View File

@ -8,20 +8,25 @@ CONFIG = {
'sources': { 'sources': {
# A list of language codes (possibly un-standardized) that we'll # A list of language codes (possibly un-standardized) that we'll
# look up in filenames for these various data sources. # look up in filenames for these various data sources.
#
# Consider adding:
# 'th' when we get tokenization for it
# 'hi' when we stop messing up its tokenization
# 'tl' because it's probably ready right now
# 'pl' because we have 3 sources for it
'twitter': [ 'twitter': [
'ar', 'de', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl', 'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru', 'pt', 'ru', 'tr'
# can be added later: 'th', 'tr'
], ],
'wikipedia': [ 'wikipedia': [
'ar', 'de', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl', 'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru' 'pt', 'ru', 'tr'
# many more can be added
], ],
'opensubtitles': [ 'opensubtitles': [
# All languages where the most common word in OpenSubtitles # This list includes languages where the most common word in
# appears at least 5000 times # OpenSubtitles appears at least 5000 times. However, we exclude
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', # German, where SUBTLEX has done better processing of the same data.
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv', 'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
'sr', 'sv', 'tr', 'uk', 'zh' 'sr', 'sv', 'tr', 'uk', 'zh'
@ -33,14 +38,19 @@ CONFIG = {
'en', 'en',
# Using the 2012 data, we could get French, German, Italian, # Using the 2012 data, we could get French, German, Italian,
# Russian, Spanish, and (Simplified) Chinese. # Russian, Spanish, and (Simplified) Chinese.
] ],
'subtlex-en': ['en'],
'subtlex-other': ['de', 'nl', 'zh'],
}, },
# Subtlex languages that need to be pre-processed
'wordlist_paths': { 'wordlist_paths': {
'twitter': 'generated/twitter/tweets-2014.{lang}.{ext}', 'twitter': 'generated/twitter/tweets-2014.{lang}.{ext}',
'wikipedia': 'generated/wikipedia/wikipedia_{lang}.{ext}', 'wikipedia': 'generated/wikipedia/wikipedia_{lang}.{ext}',
'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}', 'opensubtitles': 'generated/opensubtitles/opensubtitles_{lang}.{ext}',
'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}', 'leeds': 'generated/leeds/leeds_internet_{lang}.{ext}',
'google-books': 'generated/google-books/google_books_{lang}.{ext}', 'google-books': 'generated/google-books/google_books_{lang}.{ext}',
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
'combined': 'generated/combined/combined_{lang}.{ext}', 'combined': 'generated/combined/combined_{lang}.{ext}',
'combined-dist': 'dist/combined_{lang}.{ext}', 'combined-dist': 'dist/combined_{lang}.{ext}',
'twitter-dist': 'dist/twitter_{lang}.{ext}' 'twitter-dist': 'dist/twitter_{lang}.{ext}'

View File

@ -5,7 +5,8 @@ import sys
import pathlib import pathlib
HEADER = """# This file is automatically generated. Do not edit it. HEADER = """# This file is automatically generated. Do not edit it.
# You can regenerate it using the 'wordfreq-build-deps' command. # You can change its behavior by editing wordfreq_builder/ninja.py,
# and regenerate it by running 'make'.
""" """
TMPDIR = data_filename('tmp') TMPDIR = data_filename('tmp')
@ -76,6 +77,18 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
CONFIG['sources']['opensubtitles'] CONFIG['sources']['opensubtitles']
) )
) )
lines.extend(
subtlex_en_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-en']
)
)
lines.extend(
subtlex_other_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-other']
)
)
lines.extend(combine_lists(all_languages())) lines.extend(combine_lists(all_languages()))
print('\n'.join(lines), file=out) print('\n'.join(lines), file=out)
@ -140,7 +153,8 @@ def twitter_deps(input_filename, slice_prefix, combined_prefix, slices,
for language in languages for language in languages
] ]
add_dep(lines, 'tokenize_twitter', slice_file, language_outputs, add_dep(lines, 'tokenize_twitter', slice_file, language_outputs,
params={'prefix': slice_file}) params={'prefix': slice_file},
extra='wordfreq_builder/tokenizers.py')
for language in languages: for language in languages:
combined_output = wordlist_filename('twitter', language, 'tokens.txt') combined_output = wordlist_filename('twitter', language, 'tokens.txt')
@ -188,12 +202,69 @@ def opensubtitles_deps(dirname_in, languages):
prefix=dirname_in, lang=language prefix=dirname_in, lang=language
) )
reformatted_file = wordlist_filename( reformatted_file = wordlist_filename(
'opensubtitles', language, 'counts.txt') 'opensubtitles', language, 'counts.txt'
)
add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file) add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file)
return lines return lines
# Which columns of the SUBTLEX data files do the word and its frequency appear
# in?
SUBTLEX_COLUMN_MAP = {
'de': (1, 3),
'el': (2, 3),
'en': (1, 2),
'nl': (1, 2),
'zh': (1, 5)
}
def subtlex_en_deps(dirname_in, languages):
lines = []
assert languages == ['en']
regions = ['en-US', 'en-GB']
processed_files = []
for region in regions:
input_file = '{prefix}/subtlex.{region}.txt'.format(
prefix=dirname_in, region=region
)
textcol, freqcol = SUBTLEX_COLUMN_MAP['en']
processed_file = wordlist_filename('subtlex-en', region, 'processed.txt')
processed_files.append(processed_file)
add_dep(
lines, 'convert_subtlex', input_file, processed_file,
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
)
output_file = wordlist_filename('subtlex-en', 'en', 'counts.txt')
add_dep(lines, 'merge_counts', processed_files, output_file)
return lines
def subtlex_other_deps(dirname_in, languages):
lines = []
for language in languages:
input_file = '{prefix}/subtlex.{lang}.txt'.format(
prefix=dirname_in, lang=language
)
processed_file = wordlist_filename('subtlex-other', language, 'processed.txt')
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
# I hope we don't need to configure this by language anymore.
add_dep(
lines, 'convert_subtlex', input_file, processed_file,
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
)
add_dep(
lines, 'merge_counts', processed_file, output_file
)
return lines
def combine_lists(languages): def combine_lists(languages):
lines = [] lines = []
for language in languages: for language in languages:
@ -204,7 +275,8 @@ def combine_lists(languages):
] ]
output_file = wordlist_filename('combined', language) output_file = wordlist_filename('combined', language)
add_dep(lines, 'merge', input_files, output_file, add_dep(lines, 'merge', input_files, output_file,
extra='wordfreq_builder/word_counts.py') extra='wordfreq_builder/word_counts.py',
params={'cutoff': 2})
output_cBpack = wordlist_filename( output_cBpack = wordlist_filename(
'combined-dist', language, 'msgpack.gz') 'combined-dist', language, 'msgpack.gz')

View File

@ -13,7 +13,8 @@ CLD2_BAD_CHAR_RANGE = "[%s]" % "".join(
'\ufdd0-\ufdef', '\ufdd0-\ufdef',
'\N{HANGUL FILLER}', '\N{HANGUL FILLER}',
'\N{HANGUL CHOSEONG FILLER}', '\N{HANGUL CHOSEONG FILLER}',
'\N{HANGUL JUNGSEONG FILLER}' '\N{HANGUL JUNGSEONG FILLER}',
'<>'
] + ] +
[chr(65534+65536*x+y) for x in range(17) for y in range(2)] [chr(65534+65536*x+y) for x in range(17) for y in range(2)]
) )

View File

@ -32,9 +32,40 @@ def count_tokens(filename):
return counts return counts
def read_values(filename, cutoff=0, lang=None):
"""
Read words and their frequency or count values from a CSV file. Returns
a dictionary of values and the total of all values.
Only words with a value greater than or equal to `cutoff` are returned.
If `cutoff` is greater than 0, the csv file must be sorted by value
in descending order.
If lang is given, it will apply language specific preprocessing
operations.
"""
values = defaultdict(float)
total = 0.
with open(filename, encoding='utf-8', newline='') as infile:
for key, strval in csv.reader(infile):
val = float(strval)
key = fix_text(key)
if val < cutoff:
break
tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
for token in tokens:
# Use += so that, if we give the reader concatenated files with
# duplicates, it does the right thing
values[token] += val
total += val
return values, total
def read_freqs(filename, cutoff=0, lang=None): def read_freqs(filename, cutoff=0, lang=None):
""" """
Read words and their frequencies from a CSV file. Read words and their frequencies from a CSV file, normalizing the
frequencies to add up to 1.
Only words with a frequency greater than or equal to `cutoff` are returned. Only words with a frequency greater than or equal to `cutoff` are returned.
@ -44,24 +75,11 @@ def read_freqs(filename, cutoff=0, lang=None):
If lang is given, read_freqs will apply language specific preprocessing If lang is given, read_freqs will apply language specific preprocessing
operations. operations.
""" """
raw_counts = defaultdict(float) values, total = read_values(filename, cutoff, lang)
total = 0. for word in values:
with open(filename, encoding='utf-8', newline='') as infile: values[word] /= total
for key, strval in csv.reader(infile):
val = float(strval)
if val < cutoff:
break
tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
for token in tokens:
# Use += so that, if we give the reader concatenated files with
# duplicates, it does the right thing
raw_counts[fix_text(token)] += val
total += val
for word in raw_counts: return values
raw_counts[word] /= total
return raw_counts
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None): def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
@ -96,6 +114,17 @@ def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
msgpack.dump(cBpack_data, outfile) msgpack.dump(cBpack_data, outfile)
def merge_counts(count_dicts):
"""
Merge multiple dictionaries of counts by adding their entries.
"""
merged = defaultdict(int)
for count_dict in count_dicts:
for term, count in count_dict.items():
merged[term] += count
return merged
def merge_freqs(freq_dicts): def merge_freqs(freq_dicts):
""" """
Merge multiple dictionaries of frequencies, representing each word with Merge multiple dictionaries of frequencies, representing each word with