mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
This commit is contained in:
commit
bb4653f16f
95
README.md
95
README.md
@ -26,7 +26,7 @@ install them on Ubuntu:
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 16 languages (see *Supported languages* below). It loads
|
||||
used, in 18 languages (see *Supported languages* below). It loads
|
||||
efficiently-packed data structures that contain all words that appear at least
|
||||
once per million words.
|
||||
|
||||
@ -111,45 +111,49 @@ limiting the selection to words that can be typed in ASCII.
|
||||
|
||||
## Sources and supported languages
|
||||
|
||||
We compiled word frequencies from five different sources, providing us examples
|
||||
of word usage on different topics at different levels of formality. The sources
|
||||
(and the abbreviations we'll use for them) are:
|
||||
We compiled word frequencies from seven different sources, providing us
|
||||
examples of word usage on different topics at different levels of formality.
|
||||
The sources (and the abbreviations we'll use for them) are:
|
||||
|
||||
- **GBooks**: Google Books Ngrams 2013
|
||||
- **LeedsIC**: The Leeds Internet Corpus
|
||||
- **OpenSub**: OpenSubtitles
|
||||
- **SUBTLEX**: The SUBTLEX word frequency lists
|
||||
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
||||
- **Twitter**: Messages sampled from Twitter's public stream
|
||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
||||
- **Wpedia**: The full text of Wikipedia in 2015
|
||||
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||
comes with the Jieba tokenizer.
|
||||
|
||||
The following 14 languages are well-supported, with reasonable tokenization and
|
||||
The following 17 languages are well-supported, with reasonable tokenization and
|
||||
at least 3 different sources of word frequencies:
|
||||
|
||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||
──────────────────┼──────────────────────────────────────────────────
|
||||
Arabic ar │ - - Yes Yes Yes Yes
|
||||
German de │ - Yes Yes - Yes[1] Yes
|
||||
Greek el │ - - Yes Yes Yes Yes
|
||||
English en │ Yes Yes Yes Yes Yes Yes
|
||||
Spanish es │ - - Yes Yes Yes Yes
|
||||
French fr │ - - Yes Yes Yes Yes
|
||||
Indonesian id │ - - - Yes Yes Yes
|
||||
Italian it │ - - Yes Yes Yes Yes
|
||||
Japanese ja │ - - Yes - Yes Yes
|
||||
Malay ms │ - - - Yes Yes Yes
|
||||
Dutch nl │ - Yes - Yes Yes Yes
|
||||
Portuguese pt │ - - Yes Yes Yes Yes
|
||||
Russian ru │ - - Yes Yes Yes Yes
|
||||
Turkish tr │ - - - Yes Yes Yes
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
|
||||
──────────────────┼─────────────────────────────────────────────────────
|
||||
Arabic ar │ - Yes Yes Yes Yes -
|
||||
German de │ Yes - Yes Yes[1] Yes -
|
||||
Greek el │ - Yes Yes Yes Yes -
|
||||
English en │ Yes Yes Yes Yes Yes Google Books
|
||||
Spanish es │ - Yes Yes Yes Yes -
|
||||
French fr │ - Yes Yes Yes Yes -
|
||||
Indonesian id │ - Yes - Yes Yes -
|
||||
Italian it │ - Yes Yes Yes Yes -
|
||||
Japanese ja │ - - Yes Yes Yes -
|
||||
Malay ms │ - Yes - Yes Yes -
|
||||
Dutch nl │ Yes Yes - Yes Yes -
|
||||
Polish pl │ - Yes - Yes Yes -
|
||||
Portuguese pt │ - Yes Yes Yes Yes -
|
||||
Russian ru │ - Yes Yes Yes Yes -
|
||||
Swedish sv │ - Yes - Yes Yes -
|
||||
Turkish tr │ - Yes - Yes Yes -
|
||||
Chinese zh │ Yes - Yes - - Jieba
|
||||
|
||||
These languages are only marginally supported so far. We have too few data
|
||||
sources so far in Korean (feel free to suggest some), and we are lacking
|
||||
tokenization support for Chinese.
|
||||
|
||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||
──────────────────┼──────────────────────────────────────────────────
|
||||
Korean ko │ - - - - Yes Yes
|
||||
Chinese zh │ - Yes Yes Yes - -
|
||||
Additionally, Korean is marginally supported. You can look up frequencies in
|
||||
it, but we have too few data sources for it so far:
|
||||
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
|
||||
──────────────────┼───────────────────────────────────────
|
||||
Korean ko │ - - - Yes Yes
|
||||
|
||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||
you should be aware that German is not a frequently-used language on Twitter.
|
||||
@ -170,7 +174,8 @@ There are language-specific exceptions:
|
||||
- In Japanese, instead of using the regex library, it uses the external library
|
||||
`mecab-python3`. This is an optional dependency of wordfreq, and compiling
|
||||
it requires the `libmecab-dev` system package to be installed.
|
||||
- It does not yet attempt to tokenize Chinese ideograms.
|
||||
- In Chinese, it uses the external Python library `jieba`, another optional
|
||||
dependency.
|
||||
|
||||
[uax29]: http://unicode.org/reports/tr29/
|
||||
|
||||
@ -182,10 +187,14 @@ also try to deal gracefully when you query it with texts that actually break
|
||||
into multiple tokens:
|
||||
|
||||
>>> word_frequency('New York', 'en')
|
||||
0.0002632772081925718
|
||||
0.0002315934248950231
|
||||
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.2187603965715087e-06
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be.
|
||||
to provide an estimate of what their combined frequency would be. In languages
|
||||
written without spaces, there is also a penalty to the word frequency for each
|
||||
word break that must be inferred.
|
||||
|
||||
This implicitly assumes that you're asking about words that frequently appear
|
||||
together. It's not multiplying the frequencies, because that would assume they
|
||||
@ -223,14 +232,14 @@ sources:
|
||||
|
||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
|
||||
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||
(see citations below) and available at
|
||||
http://crr.ugent.be/programs-data/subtitle-frequencies.
|
||||
|
||||
I (Robyn Speer) have
|
||||
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
|
||||
in wordfreq, to be used for any purpose, not just for academic use, under these
|
||||
conditions:
|
||||
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
|
||||
distribute these wordlists in wordfreq, to be used for any purpose, not just
|
||||
for academic use, under these conditions:
|
||||
|
||||
- Wordfreq and code derived from it must credit the SUBTLEX authors.
|
||||
- It must remain clear that SUBTLEX is freely available data.
|
||||
@ -254,6 +263,11 @@ Twitter; it does not display or republish any Twitter content.
|
||||
(2015). The word frequency effect. Experimental Psychology.
|
||||
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
|
||||
|
||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
||||
(2011). The word frequency effect: A review of recent developments and
|
||||
implications for the choice of frequency estimates in German. Experimental
|
||||
Psychology, 58, 412-424.
|
||||
|
||||
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
|
||||
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
||||
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
||||
@ -277,4 +291,3 @@ Twitter; it does not display or republish any Twitter content.
|
||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521
|
||||
|
||||
|
50
scripts/make_chinese_mapping.py
Normal file
50
scripts/make_chinese_mapping.py
Normal file
@ -0,0 +1,50 @@
|
||||
"""
|
||||
Generate a msgpack file, _chinese_mapping.msgpack.gz, that maps Traditional
|
||||
Chinese characters to their Simplified Chinese equivalents.
|
||||
|
||||
This is meant to be a normalization of text, somewhat like case-folding -- not
|
||||
an actual translator, a task for which this method would be unsuitable. We
|
||||
store word frequencies using Simplified Chinese characters so that, in the
|
||||
large number of cases where a Traditional Chinese word has an obvious
|
||||
Simplified Chinese mapping, we can get a frequency for it that's the same in
|
||||
Simplified and Traditional Chinese.
|
||||
|
||||
Generating this mapping requires the external Chinese conversion tool OpenCC.
|
||||
"""
|
||||
import unicodedata
|
||||
import itertools
|
||||
import os
|
||||
import msgpack
|
||||
import gzip
|
||||
|
||||
|
||||
def make_hanzi_table(filename):
|
||||
with open(filename, 'w', encoding='utf-8') as out:
|
||||
for codept in itertools.chain(range(0x3400, 0xa000), range(0xf900, 0xfb00), range(0x20000, 0x30000)):
|
||||
char = chr(codept)
|
||||
if unicodedata.category(char) != 'Cn':
|
||||
print('%5X\t%s' % (codept, char), file=out)
|
||||
|
||||
|
||||
def make_hanzi_converter(table_in, msgpack_out):
|
||||
table = {}
|
||||
with open(table_in, encoding='utf-8') as infile:
|
||||
for line in infile:
|
||||
hexcode, char = line.rstrip('\n').split('\t')
|
||||
codept = int(hexcode, 16)
|
||||
assert len(char) == 1
|
||||
if chr(codept) != char:
|
||||
table[codept] = char
|
||||
with gzip.open(msgpack_out, 'wb') as outfile:
|
||||
msgpack.dump(table, outfile, encoding='utf-8')
|
||||
|
||||
|
||||
def build():
|
||||
make_hanzi_table('/tmp/han_in.txt')
|
||||
os.system('opencc -c zht2zhs.ini < /tmp/han_in.txt > /tmp/han_out.txt')
|
||||
make_hanzi_converter('/tmp/han_out.txt', '_chinese_mapping.msgpack.gz')
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
build()
|
||||
|
9
setup.py
9
setup.py
@ -33,7 +33,7 @@ if sys.version_info < (3, 4):
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='1.1',
|
||||
version='1.2',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||
@ -50,8 +50,11 @@ setup(
|
||||
# turn, it depends on libmecab-dev being installed on the system. It's not
|
||||
# listed under 'install_requires' because wordfreq should be usable in
|
||||
# other languages without it.
|
||||
#
|
||||
# Similarly, jieba is required for Chinese word frequencies.
|
||||
extras_require={
|
||||
'mecab': 'mecab-python3'
|
||||
'mecab': 'mecab-python3',
|
||||
'jieba': 'jieba'
|
||||
},
|
||||
tests_require=['mecab-python3'],
|
||||
tests_require=['mecab-python3', 'jieba'],
|
||||
)
|
||||
|
@ -162,8 +162,8 @@ def test_ar():
|
||||
|
||||
|
||||
def test_ideographic_fallback():
|
||||
# Try tokenizing Chinese text -- it should remain stuck together.
|
||||
eq_(tokenize('中国文字', 'zh'), ['中国文字'])
|
||||
# Try tokenizing Chinese text as English -- it should remain stuck together.
|
||||
eq_(tokenize('中国文字', 'en'), ['中国文字'])
|
||||
|
||||
# When Japanese is tagged with the wrong language, it will be split
|
||||
# at script boundaries.
|
||||
|
47
tests/test_chinese.py
Normal file
47
tests/test_chinese.py
Normal file
@ -0,0 +1,47 @@
|
||||
from nose.tools import eq_, assert_almost_equal, assert_greater
|
||||
from wordfreq import tokenize, word_frequency
|
||||
|
||||
|
||||
def test_tokens():
|
||||
# Let's test on some Chinese text that has unusual combinations of
|
||||
# syllables, because it is about an American vice-president.
|
||||
#
|
||||
# (He was the Chinese Wikipedia's featured article of the day when I
|
||||
# wrote this test.)
|
||||
|
||||
hobart = '加勒特·霍巴特' # Garret Hobart, or "jiā lè tè huò bā tè".
|
||||
|
||||
# He was the sixth American vice president to die in office.
|
||||
fact_simplified = '他是历史上第六位在任期内去世的美国副总统。'
|
||||
fact_traditional = '他是歷史上第六位在任期內去世的美國副總統。'
|
||||
|
||||
# His name breaks into five pieces, with the only piece staying together
|
||||
# being the one that means 'Bart'. The dot is not included as a token.
|
||||
eq_(
|
||||
tokenize(hobart, 'zh'),
|
||||
['加', '勒', '特', '霍', '巴特']
|
||||
)
|
||||
|
||||
eq_(
|
||||
tokenize(fact_simplified, 'zh'),
|
||||
[
|
||||
# he / is / in history / #6 / counter for people
|
||||
'他', '是', '历史上', '第六', '位',
|
||||
# during / term of office / in / die
|
||||
'在', '任期', '内', '去世',
|
||||
# of / U.S. / deputy / president
|
||||
'的', '美国', '副', '总统'
|
||||
]
|
||||
)
|
||||
|
||||
# You match the same tokens if you look it up in Traditional Chinese.
|
||||
eq_(tokenize(fact_simplified, 'zh'), tokenize(fact_traditional, 'zh'))
|
||||
assert_greater(word_frequency(fact_traditional, 'zh'), 0)
|
||||
|
||||
|
||||
def test_combination():
|
||||
xiexie_freq = word_frequency('谢谢', 'zh') # "Thanks"
|
||||
assert_almost_equal(
|
||||
word_frequency('谢谢谢谢', 'zh'),
|
||||
xiexie_freq / 20
|
||||
)
|
@ -15,6 +15,19 @@ logger = logging.getLogger(__name__)
|
||||
CACHE_SIZE = 100000
|
||||
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))
|
||||
|
||||
# Chinese and Japanese are written without spaces. In Chinese, in particular,
|
||||
# we have to infer word boundaries from the frequencies of the words they
|
||||
# would create. When this happens, we should adjust the resulting frequency
|
||||
# to avoid creating a bias toward improbable word combinations.
|
||||
INFERRED_SPACE_LANGUAGES = {'zh'}
|
||||
|
||||
# We'll divide the frequency by 10 for each token boundary that was inferred.
|
||||
# (We determined the factor of 10 empirically by looking at words in the
|
||||
# Chinese wordlist that weren't common enough to be identified by the
|
||||
# tokenizer. These words would get split into multiple tokens, and their
|
||||
# inferred frequency would be on average 9.77 times higher than their actual
|
||||
# frequency.)
|
||||
INFERRED_SPACE_FACTOR = 10.0
|
||||
|
||||
# simple_tokenize is imported so that other things can import it from here.
|
||||
# Suppress the pyflakes warning.
|
||||
@ -80,10 +93,11 @@ def available_languages(wordlist='combined'):
|
||||
"""
|
||||
available = {}
|
||||
for path in DATA_PATH.glob('*.msgpack.gz'):
|
||||
list_name = path.name.split('.')[0]
|
||||
name, lang = list_name.split('_')
|
||||
if name == wordlist:
|
||||
available[lang] = str(path)
|
||||
if not path.name.startswith('_'):
|
||||
list_name = path.name.split('.')[0]
|
||||
name, lang = list_name.split('_')
|
||||
if name == wordlist:
|
||||
available[lang] = str(path)
|
||||
return available
|
||||
|
||||
|
||||
@ -181,7 +195,12 @@ def _word_frequency(word, lang, wordlist, minimum):
|
||||
return minimum
|
||||
one_over_result += 1.0 / freqs[token]
|
||||
|
||||
return max(1.0 / one_over_result, minimum)
|
||||
freq = 1.0 / one_over_result
|
||||
|
||||
if lang in INFERRED_SPACE_LANGUAGES:
|
||||
freq /= INFERRED_SPACE_FACTOR ** (len(tokens) - 1)
|
||||
|
||||
return max(freq, minimum)
|
||||
|
||||
def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
"""
|
||||
|
20
wordfreq/chinese.py
Normal file
20
wordfreq/chinese.py
Normal file
@ -0,0 +1,20 @@
|
||||
from pkg_resources import resource_filename
|
||||
import jieba
|
||||
import msgpack
|
||||
import gzip
|
||||
|
||||
DICT_FILENAME = resource_filename('wordfreq', 'data/jieba_zh.txt')
|
||||
SIMP_MAP_FILENAME = resource_filename('wordfreq', 'data/_chinese_mapping.msgpack.gz')
|
||||
SIMPLIFIED_MAP = msgpack.load(gzip.open(SIMP_MAP_FILENAME), encoding='utf-8')
|
||||
jieba_tokenizer = None
|
||||
|
||||
|
||||
def simplify_chinese(text):
|
||||
return text.translate(SIMPLIFIED_MAP).casefold()
|
||||
|
||||
|
||||
def jieba_tokenize(text):
|
||||
global jieba_tokenizer
|
||||
if jieba_tokenizer is None:
|
||||
jieba_tokenizer = jieba.Tokenizer(dictionary=DICT_FILENAME)
|
||||
return jieba_tokenizer.lcut(simplify_chinese(text), HMM=False)
|
BIN
wordfreq/data/_chinese_mapping.msgpack.gz
Normal file
BIN
wordfreq/data/_chinese_mapping.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_pl.msgpack.gz
Normal file
BIN
wordfreq/data/combined_pl.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/combined_sv.msgpack.gz
Normal file
BIN
wordfreq/data/combined_sv.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
36124
wordfreq/data/jieba_zh.txt
Normal file
36124
wordfreq/data/jieba_zh.txt
Normal file
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_pl.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_pl.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/twitter_sv.msgpack.gz
Normal file
BIN
wordfreq/data/twitter_sv.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
@ -1,5 +1,6 @@
|
||||
import regex
|
||||
import unicodedata
|
||||
from pkg_resources import resource_filename
|
||||
|
||||
|
||||
TOKEN_RE = regex.compile(r"""
|
||||
@ -87,6 +88,7 @@ def remove_arabic_marks(text):
|
||||
|
||||
|
||||
mecab_tokenize = None
|
||||
jieba_tokenize = None
|
||||
def tokenize(text, lang):
|
||||
"""
|
||||
Tokenize this text in a way that's relatively simple but appropriate for
|
||||
@ -115,8 +117,17 @@ def tokenize(text, lang):
|
||||
if lang == 'ja':
|
||||
global mecab_tokenize
|
||||
if mecab_tokenize is None:
|
||||
from wordfreq.mecab import mecab_tokenize
|
||||
return mecab_tokenize(text)
|
||||
from wordfreq.japanese import mecab_tokenize
|
||||
tokens = mecab_tokenize(text)
|
||||
return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
|
||||
|
||||
if lang == 'zh':
|
||||
global jieba_tokenize
|
||||
if jieba_tokenize is None:
|
||||
from wordfreq.chinese import jieba_tokenize
|
||||
tokens = jieba_tokenize(text)
|
||||
return [token.casefold() for token in tokens if TOKEN_RE.match(token)]
|
||||
|
||||
|
||||
if lang == 'tr':
|
||||
return turkish_tokenize(text)
|
||||
|
Binary file not shown.
Before Width: | Height: | Size: 1.9 MiB After Width: | Height: | Size: 1.9 MiB |
@ -32,10 +32,15 @@ rule wiki2text
|
||||
command = bunzip2 -c $in | wiki2text > $out
|
||||
|
||||
# To tokenize Japanese, we run it through Mecab and take the first column.
|
||||
# We don't have a plan for tokenizing Chinese yet.
|
||||
rule tokenize_japanese
|
||||
command = mecab -b 1048576 < $in | cut -f 1 | grep -v "EOS" > $out
|
||||
|
||||
# Process Chinese by converting all Traditional Chinese characters to
|
||||
# Simplified equivalents -- not because that's a good way to get readable
|
||||
# text, but because that's how we're going to look them up.
|
||||
rule simplify_chinese
|
||||
command = python -m wordfreq_builder.cli.simplify_chinese < $in > $out
|
||||
|
||||
# Tokenizing text from Twitter requires us to language-detect and tokenize
|
||||
# in the same step.
|
||||
rule tokenize_twitter
|
||||
@ -62,6 +67,13 @@ rule convert_opensubtitles
|
||||
rule convert_subtlex
|
||||
command = cut -f $textcol,$freqcol $in | tail -n +$startrow | ftfy | tr ' ",' ', ' | grep -v 'â,' > $out
|
||||
|
||||
rule convert_jieba
|
||||
command = cut -d ' ' -f 1,2 $in | grep -v '[,"]' | tr ' ' ',' > $out
|
||||
|
||||
rule counts_to_jieba
|
||||
command = python -m wordfreq_builder.cli.counts_to_jieba $in $out
|
||||
|
||||
|
||||
# Convert and clean up the Google Books Syntactic N-grams data. Concatenate all
|
||||
# the input files, keep only the single words and their counts, and only keep
|
||||
# lines with counts of 100 or more.
|
||||
@ -77,13 +89,13 @@ rule count
|
||||
command = python -m wordfreq_builder.cli.count_tokens $in $out
|
||||
|
||||
rule merge
|
||||
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff $in
|
||||
command = python -m wordfreq_builder.cli.merge_freqs -o $out -c $cutoff -l $lang $in
|
||||
|
||||
rule merge_counts
|
||||
command = python -m wordfreq_builder.cli.merge_counts -o $out $in
|
||||
|
||||
rule freqs2cB
|
||||
command = python -m wordfreq_builder.cli.freqs_to_cB $lang $in $out
|
||||
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out
|
||||
|
||||
rule cat
|
||||
command = cat $in > $out
|
||||
|
15
wordfreq_builder/wordfreq_builder/cli/counts_to_jieba.py
Normal file
15
wordfreq_builder/wordfreq_builder/cli/counts_to_jieba.py
Normal file
@ -0,0 +1,15 @@
|
||||
from wordfreq_builder.word_counts import read_values, write_jieba
|
||||
import argparse
|
||||
|
||||
|
||||
def handle_counts(filename_in, filename_out):
|
||||
freqs, total = read_values(filename_in, cutoff=1e-6)
|
||||
write_jieba(freqs, filename_out)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename_in', help='name of input wordlist')
|
||||
parser.add_argument('filename_out', help='name of output Jieba-compatible wordlist')
|
||||
args = parser.parse_args()
|
||||
handle_counts(args.filename_in, args.filename_out)
|
@ -4,8 +4,7 @@ import argparse
|
||||
|
||||
if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('language', help='language of the input file')
|
||||
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||
parser.add_argument('filename_out', help='name of output file')
|
||||
args = parser.parse_args()
|
||||
freqs_to_cBpack(args.filename_in, args.filename_out, lang=args.language)
|
||||
freqs_to_cBpack(args.filename_in, args.filename_out)
|
||||
|
@ -2,10 +2,16 @@ from wordfreq_builder.word_counts import read_freqs, merge_freqs, write_wordlist
|
||||
import argparse
|
||||
|
||||
|
||||
def merge_lists(input_names, output_name, cutoff):
|
||||
def merge_lists(input_names, output_name, cutoff, lang):
|
||||
freq_dicts = []
|
||||
|
||||
# Don't use Chinese tokenization while building wordlists, as that would
|
||||
# create a circular dependency.
|
||||
if lang == 'zh':
|
||||
lang = None
|
||||
|
||||
for input_name in input_names:
|
||||
freq_dicts.append(read_freqs(input_name, cutoff=cutoff))
|
||||
freq_dicts.append(read_freqs(input_name, cutoff=cutoff, lang=lang))
|
||||
merged = merge_freqs(freq_dicts)
|
||||
write_wordlist(merged, output_name)
|
||||
|
||||
@ -14,7 +20,8 @@ if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('-o', '--output', help='filename to write the output to', default='combined-freqs.csv')
|
||||
parser.add_argument('-c', '--cutoff', type=int, help='stop after seeing a count below this', default=2)
|
||||
parser.add_argument('-l', '--language', help='language code for which language the words are in', default=None)
|
||||
parser.add_argument('inputs', help='names of input files to merge', nargs='+')
|
||||
args = parser.parse_args()
|
||||
merge_lists(args.inputs, args.output, args.cutoff)
|
||||
merge_lists(args.inputs, args.output, args.cutoff, args.language)
|
||||
|
||||
|
11
wordfreq_builder/wordfreq_builder/cli/simplify_chinese.py
Normal file
11
wordfreq_builder/wordfreq_builder/cli/simplify_chinese.py
Normal file
@ -0,0 +1,11 @@
|
||||
from wordfreq.chinese import simplify_chinese
|
||||
import sys
|
||||
|
||||
|
||||
def main():
|
||||
for line in sys.stdin:
|
||||
sys.stdout.write(simplify_chinese(line))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
@ -1,35 +1,34 @@
|
||||
import os
|
||||
|
||||
CONFIG = {
|
||||
'version': '1.0b',
|
||||
# data_dir is a relative or absolute path to where the wordlist data
|
||||
# is stored
|
||||
'data_dir': 'data',
|
||||
'sources': {
|
||||
# A list of language codes (possibly un-standardized) that we'll
|
||||
# look up in filenames for these various data sources.
|
||||
# A list of language codes that we'll look up in filenames for these
|
||||
# various data sources.
|
||||
#
|
||||
# Consider adding:
|
||||
# 'th' when we get tokenization for it
|
||||
# 'hi' when we stop messing up its tokenization
|
||||
# 'tl' because it's probably ready right now
|
||||
# 'pl' because we have 3 sources for it
|
||||
# 'tl' with one more data source
|
||||
'twitter': [
|
||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pt', 'ru', 'tr'
|
||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
||||
],
|
||||
'wikipedia': [
|
||||
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
|
||||
'pt', 'ru', 'tr'
|
||||
'pl', 'pt', 'ru', 'sv', 'tr'
|
||||
],
|
||||
'opensubtitles': [
|
||||
# This list includes languages where the most common word in
|
||||
# OpenSubtitles appears at least 5000 times. However, we exclude
|
||||
# German, where SUBTLEX has done better processing of the same data.
|
||||
# languages where SUBTLEX has apparently done a better job,
|
||||
# specifically German and Chinese.
|
||||
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
|
||||
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
|
||||
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
|
||||
'sr', 'sv', 'tr', 'uk', 'zh'
|
||||
'sr', 'sv', 'tr', 'uk'
|
||||
],
|
||||
'leeds': [
|
||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'it', 'ja', 'pt', 'ru', 'zh'
|
||||
@ -41,6 +40,7 @@ CONFIG = {
|
||||
],
|
||||
'subtlex-en': ['en'],
|
||||
'subtlex-other': ['de', 'nl', 'zh'],
|
||||
'jieba': ['zh']
|
||||
},
|
||||
# Subtlex languages that need to be pre-processed
|
||||
'wordlist_paths': {
|
||||
@ -51,9 +51,11 @@ CONFIG = {
|
||||
'google-books': 'generated/google-books/google_books_{lang}.{ext}',
|
||||
'subtlex-en': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||
'subtlex-other': 'generated/subtlex/subtlex_{lang}.{ext}',
|
||||
'jieba': 'generated/jieba/jieba_{lang}.{ext}',
|
||||
'combined': 'generated/combined/combined_{lang}.{ext}',
|
||||
'combined-dist': 'dist/combined_{lang}.{ext}',
|
||||
'twitter-dist': 'dist/twitter_{lang}.{ext}'
|
||||
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
||||
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
||||
},
|
||||
'min_sources': 2
|
||||
}
|
||||
|
@ -3,6 +3,7 @@ from wordfreq_builder.config import (
|
||||
)
|
||||
import sys
|
||||
import pathlib
|
||||
import itertools
|
||||
|
||||
HEADER = """# This file is automatically generated. Do not edit it.
|
||||
# You can change its behavior by editing wordfreq_builder/ninja.py,
|
||||
@ -45,51 +46,43 @@ def make_ninja_deps(rules_filename, out=sys.stdout):
|
||||
# The first dependency is to make sure the build file is up to date.
|
||||
add_dep(lines, 'build_deps', 'rules.ninja', 'build.ninja',
|
||||
extra='wordfreq_builder/ninja.py')
|
||||
lines.extend(
|
||||
lines.extend(itertools.chain(
|
||||
twitter_deps(
|
||||
data_filename('raw-input/twitter/all-2014.txt'),
|
||||
slice_prefix=data_filename('slices/twitter/tweets-2014'),
|
||||
combined_prefix=data_filename('generated/twitter/tweets-2014'),
|
||||
slices=40,
|
||||
languages=CONFIG['sources']['twitter']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
wikipedia_deps(
|
||||
data_filename('raw-input/wikipedia'),
|
||||
CONFIG['sources']['wikipedia']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
google_books_deps(
|
||||
data_filename('raw-input/google-books')
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
leeds_deps(
|
||||
data_filename('source-lists/leeds'),
|
||||
CONFIG['sources']['leeds']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
opensubtitles_deps(
|
||||
data_filename('source-lists/opensubtitles'),
|
||||
CONFIG['sources']['opensubtitles']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
subtlex_en_deps(
|
||||
data_filename('source-lists/subtlex'),
|
||||
CONFIG['sources']['subtlex-en']
|
||||
)
|
||||
)
|
||||
lines.extend(
|
||||
),
|
||||
subtlex_other_deps(
|
||||
data_filename('source-lists/subtlex'),
|
||||
CONFIG['sources']['subtlex-other']
|
||||
)
|
||||
)
|
||||
lines.extend(combine_lists(all_languages()))
|
||||
),
|
||||
jieba_deps(
|
||||
data_filename('source-lists/jieba'),
|
||||
CONFIG['sources']['jieba']
|
||||
),
|
||||
combine_lists(all_languages())
|
||||
))
|
||||
|
||||
print('\n'.join(lines), file=out)
|
||||
|
||||
@ -189,8 +182,14 @@ def leeds_deps(dirname_in, languages):
|
||||
input_file = '{prefix}/internet-{lang}-forms.num'.format(
|
||||
prefix=dirname_in, lang=language
|
||||
)
|
||||
if language == 'zh':
|
||||
step2_file = wordlist_filename('leeds', 'zh-Hans', 'converted.txt')
|
||||
add_dep(lines, 'simplify_chinese', input_file, step2_file)
|
||||
else:
|
||||
step2_file = input_file
|
||||
|
||||
reformatted_file = wordlist_filename('leeds', language, 'counts.txt')
|
||||
add_dep(lines, 'convert_leeds', input_file, reformatted_file)
|
||||
add_dep(lines, 'convert_leeds', step2_file, reformatted_file)
|
||||
|
||||
return lines
|
||||
|
||||
@ -201,14 +200,38 @@ def opensubtitles_deps(dirname_in, languages):
|
||||
input_file = '{prefix}/{lang}.txt'.format(
|
||||
prefix=dirname_in, lang=language
|
||||
)
|
||||
if language == 'zh':
|
||||
step2_file = wordlist_filename('opensubtitles', 'zh-Hans', 'converted.txt')
|
||||
add_dep(lines, 'simplify_chinese', input_file, step2_file)
|
||||
else:
|
||||
step2_file = input_file
|
||||
reformatted_file = wordlist_filename(
|
||||
'opensubtitles', language, 'counts.txt'
|
||||
)
|
||||
add_dep(lines, 'convert_opensubtitles', input_file, reformatted_file)
|
||||
add_dep(lines, 'convert_opensubtitles', step2_file, reformatted_file)
|
||||
|
||||
return lines
|
||||
|
||||
|
||||
def jieba_deps(dirname_in, languages):
|
||||
lines = []
|
||||
# Because there's Chinese-specific handling here, the valid options for
|
||||
# 'languages' are [] and ['zh']. Make sure it's one of those.
|
||||
if not languages:
|
||||
return lines
|
||||
assert languages == ['zh']
|
||||
input_file = '{prefix}/dict.txt.big'.format(prefix=dirname_in)
|
||||
transformed_file = wordlist_filename(
|
||||
'jieba', 'zh-Hans', 'converted.txt'
|
||||
)
|
||||
reformatted_file = wordlist_filename(
|
||||
'jieba', 'zh', 'counts.txt'
|
||||
)
|
||||
add_dep(lines, 'simplify_chinese', input_file, transformed_file)
|
||||
add_dep(lines, 'convert_jieba', transformed_file, reformatted_file)
|
||||
return lines
|
||||
|
||||
|
||||
# Which columns of the SUBTLEX data files do the word and its frequency appear
|
||||
# in?
|
||||
SUBTLEX_COLUMN_MAP = {
|
||||
@ -222,6 +245,9 @@ SUBTLEX_COLUMN_MAP = {
|
||||
|
||||
def subtlex_en_deps(dirname_in, languages):
|
||||
lines = []
|
||||
# Either subtlex_en is turned off, or it's just in English
|
||||
if not languages:
|
||||
return lines
|
||||
assert languages == ['en']
|
||||
regions = ['en-US', 'en-GB']
|
||||
processed_files = []
|
||||
@ -253,10 +279,16 @@ def subtlex_other_deps(dirname_in, languages):
|
||||
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
|
||||
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
|
||||
|
||||
if language == 'zh':
|
||||
step2_file = wordlist_filename('subtlex-other', 'zh-Hans', 'converted.txt')
|
||||
add_dep(lines, 'simplify_chinese', input_file, step2_file)
|
||||
else:
|
||||
step2_file = input_file
|
||||
|
||||
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
|
||||
# I hope we don't need to configure this by language anymore.
|
||||
add_dep(
|
||||
lines, 'convert_subtlex', input_file, processed_file,
|
||||
lines, 'convert_subtlex', step2_file, processed_file,
|
||||
params={'textcol': textcol, 'freqcol': freqcol, 'startrow': 2}
|
||||
)
|
||||
add_dep(
|
||||
@ -276,10 +308,11 @@ def combine_lists(languages):
|
||||
output_file = wordlist_filename('combined', language)
|
||||
add_dep(lines, 'merge', input_files, output_file,
|
||||
extra='wordfreq_builder/word_counts.py',
|
||||
params={'cutoff': 2})
|
||||
params={'cutoff': 2, 'lang': language})
|
||||
|
||||
output_cBpack = wordlist_filename(
|
||||
'combined-dist', language, 'msgpack.gz')
|
||||
'combined-dist', language, 'msgpack.gz'
|
||||
)
|
||||
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
|
||||
extra='wordfreq_builder/word_counts.py',
|
||||
params={'lang': language})
|
||||
@ -297,6 +330,12 @@ def combine_lists(languages):
|
||||
|
||||
lines.append('default {}'.format(output_cBpack))
|
||||
|
||||
# Write a Jieba-compatible frequency file for Chinese tokenization
|
||||
chinese_combined = wordlist_filename('combined', 'zh')
|
||||
jieba_output = wordlist_filename('jieba-dist', 'zh')
|
||||
add_dep(lines, 'counts_to_jieba', chinese_combined, jieba_output,
|
||||
extra=['wordfreq_builder/word_counts.py', 'wordfreq_builder/cli/counts_to_jieba.py'])
|
||||
lines.append('default {}'.format(jieba_output))
|
||||
return lines
|
||||
|
||||
|
||||
|
@ -32,6 +32,12 @@ def cld2_surface_tokenizer(text):
|
||||
text = TWITTER_HANDLE_RE.sub('', text)
|
||||
text = TCO_RE.sub('', text)
|
||||
lang = cld2_detect_language(text)
|
||||
|
||||
# Don't allow tokenization in Chinese when language-detecting, because
|
||||
# the Chinese tokenizer may not be built yet
|
||||
if lang == 'zh':
|
||||
lang = 'en'
|
||||
|
||||
tokens = tokenize(text, lang)
|
||||
return lang, tokens
|
||||
|
||||
|
@ -12,6 +12,7 @@ import regex
|
||||
# Match common cases of URLs: the schema http:// or https:// followed by
|
||||
# non-whitespace characters.
|
||||
URL_RE = regex.compile(r'https?://(?:\S)+')
|
||||
HAN_RE = regex.compile(r'[\p{Script=Han}]+')
|
||||
|
||||
|
||||
def count_tokens(filename):
|
||||
@ -42,8 +43,8 @@ def read_values(filename, cutoff=0, lang=None):
|
||||
If `cutoff` is greater than 0, the csv file must be sorted by value
|
||||
in descending order.
|
||||
|
||||
If lang is given, it will apply language specific preprocessing
|
||||
operations.
|
||||
If `lang` is given, it will apply language-specific tokenization to the
|
||||
words that it reads.
|
||||
"""
|
||||
values = defaultdict(float)
|
||||
total = 0.
|
||||
@ -79,10 +80,13 @@ def read_freqs(filename, cutoff=0, lang=None):
|
||||
for word in values:
|
||||
values[word] /= total
|
||||
|
||||
if lang == 'en':
|
||||
values = correct_apostrophe_trimming(values)
|
||||
|
||||
return values
|
||||
|
||||
|
||||
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
|
||||
def freqs_to_cBpack(in_filename, out_filename, cutoff=-600):
|
||||
"""
|
||||
Convert a csv file of words and their frequencies to a file in the
|
||||
idiosyncratic 'cBpack' format.
|
||||
@ -93,7 +97,7 @@ def freqs_to_cBpack(in_filename, out_filename, cutoff=-600, lang=None):
|
||||
This cutoff should not be stacked with a cutoff in `read_freqs`; doing
|
||||
so would skew the resulting frequencies.
|
||||
"""
|
||||
freqs = read_freqs(in_filename, cutoff=0, lang=lang)
|
||||
freqs = read_freqs(in_filename, cutoff=0, lang=None)
|
||||
cBpack = []
|
||||
for token, freq in freqs.items():
|
||||
cB = round(math.log10(freq) * 100)
|
||||
@ -162,3 +166,65 @@ def write_wordlist(freqs, filename, cutoff=1e-8):
|
||||
break
|
||||
if not ('"' in word or ',' in word):
|
||||
writer.writerow([word, str(freq)])
|
||||
|
||||
|
||||
def write_jieba(freqs, filename):
|
||||
"""
|
||||
Write a dictionary of frequencies in a format that can be used for Jieba
|
||||
tokenization of Chinese.
|
||||
"""
|
||||
with open(filename, 'w', encoding='utf-8', newline='\n') as outfile:
|
||||
items = sorted(freqs.items(), key=lambda item: (-item[1], item[0]))
|
||||
for word, freq in items:
|
||||
if HAN_RE.search(word):
|
||||
# Only store this word as a token if it contains at least one
|
||||
# Han character.
|
||||
fake_count = round(freq * 1e9)
|
||||
print('%s %d' % (word, fake_count), file=outfile)
|
||||
|
||||
|
||||
# APOSTROPHE_TRIMMED_PROB represents the probability that this word has had
|
||||
# "'t" removed from it, based on counts from Twitter, which we know
|
||||
# accurate token counts for based on our own tokenizer.
|
||||
|
||||
APOSTROPHE_TRIMMED_PROB = {
|
||||
'don': 0.99,
|
||||
'didn': 1.,
|
||||
'can': 0.35,
|
||||
'won': 0.74,
|
||||
'isn': 1.,
|
||||
'wasn': 1.,
|
||||
'wouldn': 1.,
|
||||
'doesn': 1.,
|
||||
'couldn': 1.,
|
||||
'ain': 0.99,
|
||||
'aren': 1.,
|
||||
'shouldn': 1.,
|
||||
'haven': 0.96,
|
||||
'weren': 1.,
|
||||
'hadn': 1.,
|
||||
'hasn': 1.,
|
||||
'mustn': 1.,
|
||||
'needn': 1.,
|
||||
}
|
||||
|
||||
|
||||
def correct_apostrophe_trimming(freqs):
|
||||
"""
|
||||
If what we got was an English wordlist that has been tokenized with
|
||||
apostrophes as token boundaries, as indicated by the frequencies of the
|
||||
words "wouldn" and "couldn", then correct the spurious tokens we get by
|
||||
adding "'t" in about the proportion we expect to see in the wordlist.
|
||||
|
||||
We could also adjust the frequency of "t", but then we would be favoring
|
||||
the token "s" over it, as "'s" leaves behind no indication when it's been
|
||||
removed.
|
||||
"""
|
||||
if (freqs.get('wouldn', 0) > 1e-6 and freqs.get('couldn', 0) > 1e-6):
|
||||
print("Applying apostrophe trimming")
|
||||
for trim_word, trim_prob in APOSTROPHE_TRIMMED_PROB.items():
|
||||
if trim_word in freqs:
|
||||
freq = freqs[trim_word]
|
||||
freqs[trim_word] = freq * (1 - trim_prob)
|
||||
freqs[trim_word + "'t"] = freq * trim_prob
|
||||
return freqs
|
||||
|
Loading…
Reference in New Issue
Block a user