Merge pull request #27 from LuminosoInsight/chinese-and-more

Improve Chinese, Greek, English; add Turkish, Polish, Swedish
This commit is contained in:
Andrew Lin 2015-09-24 13:25:21 -04:00
commit 710eaabbe1
56 changed files with 36546 additions and 102 deletions

View File

@ -26,7 +26,7 @@ install them on Ubuntu:
## Usage
wordfreq provides access to estimates of the frequency with which a word is
used, in 16 languages (see *Supported languages* below). It loads
used, in 18 languages (see *Supported languages* below). It loads
efficiently-packed data structures that contain all words that appear at least
once per million words.
@ -111,45 +111,49 @@ limiting the selection to words that can be typed in ASCII.
## Sources and supported languages
We compiled word frequencies from five different sources, providing us examples
of word usage on different topics at different levels of formality. The sources
(and the abbreviations we'll use for them) are:
We compiled word frequencies from seven different sources, providing us
examples of word usage on different topics at different levels of formality.
The sources (and the abbreviations we'll use for them) are:
- **GBooks**: Google Books Ngrams 2013
- **LeedsIC**: The Leeds Internet Corpus
- **OpenSub**: OpenSubtitles
- **SUBTLEX**: The SUBTLEX word frequency lists
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
- **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015
- **Wpedia**: The full text of Wikipedia in 2015
- **Other**: We get additional English frequencies from Google Books Syntactic
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
comes with the Jieba tokenizer.
The following 14 languages are well-supported, with reasonable tokenization and
The following 17 languages are well-supported, with reasonable tokenization and
at least 3 different sources of word frequencies:
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Arabic ar │ - - Yes Yes Yes Yes
German de │ - Yes Yes - Yes[1] Yes
Greek el │ - - Yes Yes Yes Yes
English en │ Yes Yes Yes Yes Yes Yes
Spanish es │ - - Yes Yes Yes Yes
French fr │ - - Yes Yes Yes Yes
Indonesian id │ - - - Yes Yes Yes
Italian it │ - - Yes Yes Yes Yes
Japanese ja │ - - Yes - Yes Yes
Malay ms │ - - - Yes Yes Yes
Dutch nl │ - Yes - Yes Yes Yes
Portuguese pt │ - - Yes Yes Yes Yes
Russian ru │ - - Yes Yes Yes Yes
Turkish tr │ - - - Yes Yes Yes
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
──────────────────┼─────────────────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes -
German de │ Yes - Yes Yes[1] Yes -
Greek el │ - Yes Yes Yes Yes -
English en │ Yes Yes Yes Yes Yes Google Books
Spanish es │ - Yes Yes Yes Yes -
French fr │ - Yes Yes Yes Yes -
Indonesian id │ - Yes - Yes Yes -
Italian it │ - Yes Yes Yes Yes -
Japanese ja │ - - Yes Yes Yes -
Malay ms │ - Yes - Yes Yes -
Dutch nl │ Yes Yes - Yes Yes -
Polish pl │ - Yes - Yes Yes -
Portuguese pt │ - Yes Yes Yes Yes -
Russian ru │ - Yes Yes Yes Yes -
Swedish sv │ - Yes - Yes Yes -
Turkish tr │ - Yes - Yes Yes -
Chinese zh │ Yes - Yes - - Jieba
These languages are only marginally supported so far. We have too few data
sources so far in Korean (feel free to suggest some), and we are lacking
tokenization support for Chinese.
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Korean ko │ - - - - Yes Yes
Chinese zh │ - Yes Yes Yes - -
Additionally, Korean is marginally supported. You can look up frequencies in
it, but we have too few data sources for it so far:
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
──────────────────┼───────────────────────────────────────
Korean ko │ - - - Yes Yes
[1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter.
@ -170,7 +174,8 @@ There are language-specific exceptions:
- In Japanese, instead of using the regex library, it uses the external library
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
it requires the `libmecab-dev` system package to be installed.
- It does not yet attempt to tokenize Chinese ideograms.
- In Chinese, it uses the external Python library `jieba`, another optional
dependency.
[uax29]: http://unicode.org/reports/tr29/
@ -182,10 +187,14 @@ also try to deal gracefully when you query it with texts that actually break
into multiple tokens:
>>> word_frequency('New York', 'en')
0.0002632772081925718
0.0002315934248950231
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
3.2187603965715087e-06
The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be.
to provide an estimate of what their combined frequency would be. In languages
written without spaces, there is also a penalty to the word frequency for each
word break that must be inferred.
This implicitly assumes that you're asking about words that frequently appear
together. It's not multiplying the frequencies, because that would assume they
@ -223,14 +232,14 @@ sources:
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
(see citations below) and available at
http://crr.ugent.be/programs-data/subtitle-frequencies.
I (Rob Speer) have
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
in wordfreq, to be used for any purpose, not just for academic use, under these
conditions:
I (Rob Speer) have obtained permission by e-mail from Marc Brysbaert to
distribute these wordlists in wordfreq, to be used for any purpose, not just
for academic use, under these conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
- It must remain clear that SUBTLEX is freely available data.
@ -254,6 +263,11 @@ Twitter; it does not display or republish any Twitter content.
(2015). The word frequency effect. Experimental Psychology.
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
(2011). The word frequency effect: A review of recent developments and
implications for the choice of frequency estimates in German. Experimental
Psychology, 58, 412-424.
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
@ -277,4 +291,3 @@ Twitter; it does not display or republish any Twitter content.
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

View File

@ -0,0 +1,50 @@
"""
Generate a msgpack file, _chinese_mapping.msgpack.gz, that maps Traditional
Chinese characters to their Simplified Chinese equivalents.
This is meant to be a normalization of text, somewhat like case-folding -- not
an actual translator, a task for which this method would be unsuitable. We
store word frequencies using Simplified Chinese characters so that, in the
large number of cases where a Traditional Chinese word has an obvious
Simplified Chinese mapping, we can get a frequency for it that's the same in
Simplified and Traditional Chinese.
Generating this mapping requires the external Chinese conversion tool OpenCC.
"""
import unicodedata
import itertools
import os
import msgpack
import gzip
def make_hanzi_table(filename):
with open(filename, 'w', encoding='utf-8') as out:
for codept in itertools.chain(range(0x3400, 0xa000), range(0xf900, 0xfb00), range(0x20000, 0x30000)):
char = chr(codept)
if unicodedata.category(char) != 'Cn':
print('%5X\t%s' % (codept, char), file=out)
def make_hanzi_converter(table_in, msgpack_out):
table = {}
with open(table_in, encoding='utf-8') as infile:
for line in infile:
hexcode, char = line.rstrip('\n').split('\t')
codept = int(hexcode, 16)
assert len(char) == 1
if chr(codept) != char:
table[codept] = char
with gzip.open(msgpack_out, 'wb') as outfile:
msgpack.dump(table, outfile, encoding='utf-8')
def build():
make_hanzi_table('/tmp/han_in.txt')
os.system('opencc -c zht2zhs.ini < /tmp/han_in.txt > /tmp/han_out.txt')
make_hanzi_converter('/tmp/han_out.txt', '_chinese_mapping.msgpack.gz')
if __name__ == '__main__':
build()

View File

@ -33,7 +33,7 @@ if sys.version_info < (3, 4):
setup(
name="wordfreq",
version='1.1',
version='1.2',
maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq/',
@ -50,8 +50,11 @@ setup(
# turn, it depends on libmecab-dev being installed on the system. It's not
# listed under 'install_requires' because wordfreq should be usable in
# other languages without it.
#
# Similarly, jieba is required for Chinese word frequencies.
extras_require={
'mecab': 'mecab-python3'
'mecab': 'mecab-python3',
'jieba': 'jieba'
},
tests_require=['mecab-python3'],
tests_require=['mecab-python3', 'jieba'],
)

View File

@ -162,8 +162,8 @@ def test_ar():
def test_ideographic_fallback():
# Try tokenizing Chinese text -- it should remain stuck together.
eq_(tokenize('中国文字', 'zh'), ['中国文字'])
# Try tokenizing Chinese text as English -- it should remain stuck together.
eq_(tokenize('中国文字', 'en'), ['中国文字'])
# When Japanese is tagged with the wrong language, it will be split
# at script boundaries.

47
tests/test_chinese.py Normal file
View File

@ -0,0 +1,47 @@
from nose.tools import eq_, assert_almost_equal, assert_greater
from wordfreq import tokenize, word_frequency
def test_tokens():
# Let's test on some Chinese text that has unusual combinations of
# syllables, because it is about an American vice-president.
#
# (He was the Chinese Wikipedia's featured article of the day when I
# wrote this test.)
hobart = '加勒特·霍巴特' # Garret Hobart, or "jiā lè tè huò bā tè".
# He was the sixth American vice president to die in office.
fact_simplified = '他是历史上第六位在任期内去世的美国副总统。'
fact_traditional = '他是歷史上第六位在任期內去世的美國副總統。'
# His name breaks into five pieces, with the only piece staying together
# being the one that means 'Bart'. The dot is not included as a token.
eq_(
tokenize(hobart, 'zh'),
['', '', '', '', '巴特']
)
eq_(
tokenize(fact_simplified, 'zh'),
[
# he / is / in history / #6 / counter for people
'', '', '历史上', '第六', '',
# during / term of office / in / die
'', '任期', '', '去世',
# of / U.S. / deputy / president
'', '美国', '', '总统'
]
)
# You match the same tokens if you look it up in Traditional Chinese.
eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
assert_greater(word_frequency(fact_traditional, 'zh'), 0)
def test_combination():
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
assert_almost_equal(
word_frequency('谢谢谢谢', 'zh'),
xiexie_freq / 20
)

View File

@ -15,6 +15,19 @@ logger = logging.getLogger(__name__)
CACHE_SIZE = 100000
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
# Chinese and Japanese are written without spaces. In Chinese, in particular,
# we have to infer word boundaries from the frequencies of the words they
# would create. When this happens, we should adjust the resulting frequency
# to avoid creating a bias toward improbable word combinations.
INFERRED_SPACE_LANGUAGES = {'zh'}
# We'll divide the frequency by 10 for each token boundary that was inferred.
# (We determined the factor of 10 empirically by looking at words in the
# Chinese wordlist that weren't common enough to be identified by the
# tokenizer. These words would get split into multiple tokens, and their
# inferred frequency would be on average 9.77 times higher than their actual
# frequency.)
INFERRED_SPACE_FACTOR = 10.0
# simple_tokenize is imported so that other things can import it from here.
# Suppress the pyflakes warning.
@ -80,10 +93,11 @@ def available_languages(wordlist='combined'):
"""
available = {}
for path in DATA_PATH.glob('*.msgpack.gz'):
list_name = path.name.split('.')[0]
name, lang = list_name.split('_')
if name == wordlist:
available[lang] = str(path)
if not path.name.startswith('_'):
list_name = path.name.split('.')[0]
name, lang = list_name.split('_')
if name == wordlist:
available[lang] = str(path)
return available
@ -181,7 +195,12 @@ def _word_frequency(word, lang, wordlist, minimum):
return minimum
one_over_result += 1.0 / freqs[token]
return max(1.0 / one_over_result, minimum)
freq = 1.0 / one_over_result
if lang in INFERRED_SPACE_LANGUAGES:
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
return max(freq, minimum)
def word_frequency(word, lang, wordlist='combined', minimum=0.):
"""

20
wordfreq/chinese.py Normal file
View File

@ -0,0 +1,20 @@
from pkg_resources import resource_filename
import jieba
import msgpack
import gzip
DICT_FILENAME = resource_filename('wordfreq', 'data/jieba_zh.txt')
SIMP_MAP_FILENAME = resource_filename('wordfreq', 'data/_chinese_mapping.msgpack.gz')
SIMPLIFIED_MAP = msgpack.load(gzip.open(SIMP_MAP_FILENAME), encoding='utf-8')
jieba_tokenizer = None
def simplify_chinese(text):
return text.translate(SIMPLIFIED_MAP).casefold()
def jieba_tokenize(text):
global jieba_tokenizer
if jieba_tokenizer is None:
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

36124
wordfreq/data/jieba_zh.txt Normal file

File diff suppressed because it is too large Load Diff

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -1,5 +1,6 @@
import regex
import unicodedata
from pkg_resources import resource_filename
TOKEN_RE = regex.compile(r"""
@ -87,6 +88,7 @@ def remove_arabic_marks(text):
mecab_tokenize = None
jieba_tokenize = None
def tokenize(text, lang):
"""
Tokenize this text in a way that's relatively simple but appropriate for
@ -115,8 +117,17 @@ def tokenize(text, lang):
if lang == 'ja':
global mecab_tokenize
if mecab_tokenize is None:
from wordfreq.mecab import mecab_tokenize
return mecab_tokenize(text)
from wordfreq.japanese import mecab_tokenize
tokens = mecab_tokenize(text)
return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
if lang == 'zh':
global jieba_tokenize
if jieba_tokenize is None:
from wordfreq.chinese import jieba_tokenize
tokens = jieba_tokenize(text)
return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
if lang == 'tr':
return turkish_tokenize(text)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.9 MiB

After

Width:  |  Height:  |  Size: 1.9 MiB

View File

@ -32,10 +32,15 @@ rule wiki2text
command = bunzip2 -c $in | wiki2text > $out
# To tokenize Japanese, we run it through Mecab and take the first column.
# We don't have a plan for tokenizing Chinese yet.
rule tokenize_japanese
command = mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out
# Process Chinese by converting all Traditional Chinese characters to
# Simplified equivalents -- not because that's a good way to get readable
# text, but because that's how we're going to look them up.
rule simplify_chinese
command = python -m wordfreq_builder.cli.simplify_chinese < $in > $out
# Tokenizing text from Twitter requires us to language-detect and tokenize
# in the same step.
rule tokenize_twitter
@ -62,6 +67,13 @@ rule convert_opensubtitles
rule convert_subtlex
command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr ' ",' ', ' | grep -v 'â,' > $out
rule convert_jieba
command = cut -d ' ' -f 1,2 $in | grep -v '[,"]' | tr ' ' ',' > $out
rule counts_to_jieba
command = python -m wordfreq_builder.cli.counts_to_jieba $in $out
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
# the input files, keep only the single words and their counts, and only keep
# lines with counts of 100 or more.
@ -77,13 +89,13 @@ rule count
command = python -m wordfreq_builder.cli.count_tokens $in $out
rule merge
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff $in
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
rule merge_counts
command = python -m wordfreq_builder.cli.merge_counts -o $out $in
rule freqs2cB
command = python -m wordfreq_builder.cli.freqs_to_cB $lang $in $out
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out
rule cat
command = cat $in > $out

View File

@ -0,0 +1,15 @@
from wordfreq_builder.word_counts import read_values, write_jieba
import argparse
def handle_counts(filename_in, filename_out):
freqs, total = read_values(filename_in, cutoff=1e-6)
write_jieba(freqs, filename_out)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input wordlist')
parser.add_argument('filename_out', help='name of output Jieba-compatible wordlist')
args = parser.parse_args()
handle_counts(args.filename_in, args.filename_out)

View File

@ -4,8 +4,7 @@ import argparse
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('language', help='language of the input file')
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
args = parser.parse_args()
freqs_to_cBpack(args.filename_in, args.filename_out, lang=args.language)
freqs_to_cBpack(args.filename_in, args.filename_out)

View File

@ -2,10 +2,16 @@ from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
import argparse
def merge_lists(input_names, output_name, cutoff):
def merge_lists(input_names, output_name, cutoff, lang):
freq_dicts = []
# Don't use Chinese tokenization while building wordlists, as that would
# create a circular dependency.
if lang == 'zh':
lang = None
for input_name in input_names:
freq_dicts.append(read_freqs(input_name, cutoff=cutoff))
freq_dicts.append(read_freqs(input_name, cutoff=cutoff, lang=lang))
merged = merge_freqs(freq_dicts)
write_wordlist(merged, output_name)
@ -14,7 +20,8 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', help='filename to write the output to', default='combined-freqs.csv')
parser.add_argument('-c', '--cutoff', type=int, help='stop after seeing a count below this', default=2)
parser.add_argument('-l', '--language', help='language code for which language the words are in', default=None)
parser.add_argument('inputs', help='names of input files to merge', nargs='+')
args = parser.parse_args()
merge_lists(args.inputs, args.output, args.cutoff)
merge_lists(args.inputs, args.output, args.cutoff, args.language)

View File

@ -0,0 +1,11 @@
from wordfreq.chinese import simplify_chinese
import sys
def main():
for line in sys.stdin:
sys.stdout.write(simplify_chinese(line))
if __name__ == '__main__':
main()

View File

@ -1,35 +1,34 @@
import os
CONFIG = {
'version': '1.0b',
# data_dir is a relative or absolute path to where the wordlist data
# is stored
'data_dir': 'data',
'sources': {
# A list of language codes (possibly un-standardized) that we'll
# look up in filenames for these various data sources.
# A list of language codes that we'll look up in filenames for these
# various data sources.
#
# Consider adding:
# 'th' when we get tokenization for it
# 'hi' when we stop messing up its tokenization
# 'tl' because it's probably ready right now
# 'pl' because we have 3 sources for it
# 'tl' with one more data source
'twitter': [
'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru', 'tr'
'pl', 'pt', 'ru', 'sv', 'tr'
],
'wikipedia': [
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru', 'tr'
'pl', 'pt', 'ru', 'sv', 'tr'
],
'opensubtitles': [
# This list includes languages where the most common word in
# OpenSubtitles appears at least 5000 times. However, we exclude
# German, where SUBTLEX has done better processing of the same data.
# languages where SUBTLEX has apparently done a better job,
# specifically German and Chinese.
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
'sr', 'sv', 'tr', 'uk', 'zh'
'sr', 'sv', 'tr', 'uk'
],
'leeds': [
'ar', 'de', 'el', 'en', 'es', 'fr', 'it', 'ja', 'pt', 'ru', 'zh'
@ -41,6 +40,7 @@ CONFIG = {
],
'subtlex-en': ['en'],
'subtlex-other': ['de', 'nl', 'zh'],
'jieba': ['zh']
},
# Subtlex languages that need to be pre-processed
'wordlist_paths': {
@ -51,9 +51,11 @@ CONFIG = {
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
'combined': 'generated/combined/combined_{lang}.{ext}',
'combined-dist': 'dist/combined_{lang}.{ext}',
'twitter-dist': 'dist/twitter_{lang}.{ext}'
'twitter-dist': 'dist/twitter_{lang}.{ext}',
'jieba-dist': 'dist/jieba_{lang}.{ext}'
},
'min_sources': 2
}

View File

@ -3,6 +3,7 @@ from wordfreq_builder.config import (
)
import sys
import pathlib
import itertools
HEADER = """# This file is automatically generated. Do not edit it.
# You can change its behavior by editing wordfreq_builder/ninja.py,
@ -45,51 +46,43 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
# The first dependency is to make sure the build file is up to date.
add_dep(lines, 'build_deps', 'rules.ninja', 'build.ninja',
extra='wordfreq_builder/ninja.py')
lines.extend(
lines.extend(itertools.chain(
twitter_deps(
data_filename('raw-input/twitter/all-2014.txt'),
slice_prefix=data_filename('slices/twitter/tweets-2014'),
combined_prefix=data_filename('generated/twitter/tweets-2014'),
slices=40,
languages=CONFIG['sources']['twitter']
)
)
lines.extend(
),
wikipedia_deps(
data_filename('raw-input/wikipedia'),
CONFIG['sources']['wikipedia']
)
)
lines.extend(
),
google_books_deps(
data_filename('raw-input/google-books')
)
)
lines.extend(
),
leeds_deps(
data_filename('source-lists/leeds'),
CONFIG['sources']['leeds']
)
)
lines.extend(
),
opensubtitles_deps(
data_filename('source-lists/opensubtitles'),
CONFIG['sources']['opensubtitles']
)
)
lines.extend(
),
subtlex_en_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-en']
)
)
lines.extend(
),
subtlex_other_deps(
data_filename('source-lists/subtlex'),
CONFIG['sources']['subtlex-other']
)
)
lines.extend(combine_lists(all_languages()))
),
jieba_deps(
data_filename('source-lists/jieba'),
CONFIG['sources']['jieba']
),
combine_lists(all_languages())
))
print('\n'.join(lines), file=out)
@ -189,8 +182,14 @@ def leeds_deps(dirname_in, languages):
input_file = '{prefix}/internet-{lang}-forms.num'.format(
prefix=dirname_in, lang=language
)
if language == 'zh':
step2_file = wordlist_filename('leeds', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
reformatted_file = wordlist_filename('leeds', language, 'counts.txt')
add_dep(lines, 'convert_leeds', input_file, reformatted_file)
add_dep(lines, 'convert_leeds', step2_file, reformatted_file)
return lines
@ -201,14 +200,38 @@ def opensubtitles_deps(dirname_in, languages):
input_file = '{prefix}/{lang}.txt'.format(
prefix=dirname_in, lang=language
)
if language == 'zh':
step2_file = wordlist_filename('opensubtitles', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
reformatted_file = wordlist_filename(
'opensubtitles', language, 'counts.txt'
)
add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file)
add_dep(lines, 'convert_opensubtitles', step2_file, reformatted_file)
return lines
def jieba_deps(dirname_in, languages):
lines = []
# Because there's Chinese-specific handling here, the valid options for
# 'languages' are [] and ['zh']. Make sure it's one of those.
if not languages:
return lines
assert languages == ['zh']
input_file = '{prefix}/dict.txt.big'.format(prefix=dirname_in)
transformed_file = wordlist_filename(
'jieba', 'zh-Hans', 'converted.txt'
)
reformatted_file = wordlist_filename(
'jieba', 'zh', 'counts.txt'
)
add_dep(lines, 'simplify_chinese', input_file, transformed_file)
add_dep(lines, 'convert_jieba', transformed_file, reformatted_file)
return lines
# Which columns of the SUBTLEX data files do the word and its frequency appear
# in?
SUBTLEX_COLUMN_MAP = {
@ -222,6 +245,9 @@ SUBTLEX_COLUMN_MAP = {
def subtlex_en_deps(dirname_in, languages):
lines = []
# Either subtlex_en is turned off, or it's just in English
if not languages:
return lines
assert languages == ['en']
regions = ['en-US', 'en-GB']
processed_files = []
@ -253,10 +279,16 @@ def subtlex_other_deps(dirname_in, languages):
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
if language == 'zh':
step2_file = wordlist_filename('subtlex-other', 'zh-Hans', 'converted.txt')
add_dep(lines, 'simplify_chinese', input_file, step2_file)
else:
step2_file = input_file
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
# I hope we don't need to configure this by language anymore.
add_dep(
lines, 'convert_subtlex', input_file, processed_file,
lines, 'convert_subtlex', step2_file, processed_file,
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
)
add_dep(
@ -276,10 +308,11 @@ def combine_lists(languages):
output_file = wordlist_filename('combined', language)
add_dep(lines, 'merge', input_files, output_file,
extra='wordfreq_builder/word_counts.py',
params={'cutoff': 2})
params={'cutoff': 2, 'lang': language})
output_cBpack = wordlist_filename(
'combined-dist', language, 'msgpack.gz')
'combined-dist', language, 'msgpack.gz'
)
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
extra='wordfreq_builder/word_counts.py',
params={'lang': language})
@ -297,6 +330,12 @@ def combine_lists(languages):
lines.append('default {}'.format(output_cBpack))
# Write a Jieba-compatible frequency file for Chinese tokenization
chinese_combined = wordlist_filename('combined', 'zh')
jieba_output = wordlist_filename('jieba-dist', 'zh')
add_dep(lines, 'counts_to_jieba', chinese_combined, jieba_output,
extra=['wordfreq_builder/word_counts.py', 'wordfreq_builder/cli/counts_to_jieba.py'])
lines.append('default {}'.format(jieba_output))
return lines

View File

@ -32,6 +32,12 @@ def cld2_surface_tokenizer(text):
text = TWITTER_HANDLE_RE.sub('', text)
text = TCO_RE.sub('', text)
lang = cld2_detect_language(text)
# Don't allow tokenization in Chinese when language-detecting, because
# the Chinese tokenizer may not be built yet
if lang == 'zh':
lang = 'en'
tokens = tokenize(text, lang)
return lang, tokens

View File

@ -12,6 +12,7 @@ import regex
# Match common cases of URLs: the schema http:// or https:// followed by
# non-whitespace characters.
URL_RE = regex.compile(r'https?://(?:\S)+')
HAN_RE = regex.compile(r'[\p{Script=Han}]+')
def count_tokens(filename):
@ -42,8 +43,8 @@ def read_values(filename, cutoff=0, lang=None):
If `cutoff` is greater than 0, the csv file must be sorted by value
in descending order.
If lang is given, it will apply language specific preprocessing
operations.
If `lang` is given, it will apply language-specific tokenization to the
words that it reads.
"""
values = defaultdict(float)
total = 0.
@ -79,10 +80,13 @@ def read_freqs(filename, cutoff=0, lang=None):
for word in values:
values[word] /= total
if lang == 'en':
values = correct_apostrophe_trimming(values)
return values
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600):
"""
Convert a csv file of words and their frequencies to a file in the
idiosyncratic 'cBpack' format.
@ -93,7 +97,7 @@ def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
This cutoff should not be stacked with a cutoff in `read_freqs`; doing
so would skew the resulting frequencies.
"""
freqs = read_freqs(in_filename, cutoff=0, lang=lang)
freqs = read_freqs(in_filename, cutoff=0, lang=None)
cBpack = []
for token, freq in freqs.items():
cB = round(math.log10(freq) * 100)
@ -162,3 +166,65 @@ def write_wordlist(freqs, filename, cutoff=1e-8):
break
if not ('"' in word or ',' in word):
writer.writerow([word, str(freq)])
def write_jieba(freqs, filename):
"""
Write a dictionary of frequencies in a format that can be used for Jieba
tokenization of Chinese.
"""
with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
items = sorted(freqs.items(), key=lambda item: (-item[1], item[0]))
for word, freq in items:
if HAN_RE.search(word):
# Only store this word as a token if it contains at least one
# Han character.
fake_count = round(freq * 1e9)
print('%s %d' % (word, fake_count), file=outfile)
# APOSTROPHE_TRIMMED_PROB represents the probability that this word has had
# "'t" removed from it, based on counts from Twitter, which we know
# accurate token counts for based on our own tokenizer.
APOSTROPHE_TRIMMED_PROB = {
'don': 0.99,
'didn': 1.,
'can': 0.35,
'won': 0.74,
'isn': 1.,
'wasn': 1.,
'wouldn': 1.,
'doesn': 1.,
'couldn': 1.,
'ain': 0.99,
'aren': 1.,
'shouldn': 1.,
'haven': 0.96,
'weren': 1.,
'hadn': 1.,
'hasn': 1.,
'mustn': 1.,
'needn': 1.,
}
def correct_apostrophe_trimming(freqs):
"""
If what we got was an English wordlist that has been tokenized with
apostrophes as token boundaries, as indicated by the frequencies of the
words "wouldn" and "couldn", then correct the spurious tokens we get by
adding "'t" in about the proportion we expect to see in the wordlist.
We could also adjust the frequency of "t", but then we would be favoring
the token "s" over it, as "'s" leaves behind no indication when it's been
removed.
"""
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
print("Applying apostrophe trimming")
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
if trim_word in freqs:
freq = freqs[trim_word]
freqs[trim_word] = freq * (1 - trim_prob)
freqs[trim_word + "'t"] = freq * trim_prob
return freqs