Merge pull request #34 from LuminosoInsight/big-list

wordfreq 1.4: some bigger wordlists, better use of language detection

Former-commit-id: e7b34fb655
This commit is contained in:
Andrew Lin 2016-05-11 16:27:51 -04:00
commit 7a55e0ed86
52 changed files with 291 additions and 122 deletions

124
README.md
View File

@ -39,11 +39,18 @@ For example:
## Usage
wordfreq provides access to estimates of the frequency with which a word is
used, in 18 languages (see *Supported languages* below). It loads
efficiently-packed data structures that contain all words that appear at least
once per million words.
used, in 18 languages (see *Supported languages* below).
The most useful function is:
It provides three kinds of pre-built wordlists:
- `'combined'` lists, containing words that appear at least once per
million words, averaged across all data sources.
- `'twitter'` lists, containing words that appear at least once per
million words on Twitter alone.
- `'large'` lists, containing words that appear at least once per 100
million words, averaged across all data sources.
The most straightforward function is:
word_frequency(word, lang, wordlist='combined', minimum=0.0)
@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
>>> word_frequency('café', 'fr') * 1e6
77.62471166286912
The parameters are:
`zipf_frequency` is a variation on `word_frequency` that aims to return the
word frequency on a human-friendly logarithmic scale. The Zipf scale was
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
of a word is the base-10 logarithm of the number of times it appears per
billion words. A word with Zipf value 6 appears once per thousand words, for
example, and a word with Zipf value 3 appears once per million words.
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
described above, the minimum Zipf value appearing in these lists is 1.0 for the
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
for words that do not appear in the given wordlist, although it should mean
one occurrence per billion words.
>>> zipf_frequency('the', 'en')
7.59
>>> zipf_frequency('word', 'en')
5.34
>>> zipf_frequency('frequency', 'en')
4.44
>>> zipf_frequency('zipf', 'en')
0.0
>>> zipf_frequency('zipf', 'en', wordlist='large')
1.42
The parameters to `word_frequency` and `zipf_frequency` are:
- `word`: a Unicode string containing the word to look up. Ideally the word
is a single token according to our tokenizer, but if not, there is still
@ -73,21 +110,18 @@ The parameters are:
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
- `wordlist`: which set of word frequencies to use. Current options are
'combined', which combines up to five different sources, and
'twitter', which returns frequencies observed on Twitter alone.
'combined', 'twitter', and 'large'.
- `minimum`: If the word is not in the list or has a frequency lower than
`minimum`, return `minimum` instead. In some applications, you'll want
to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
a frequency of 1e-6 (1 per million) is the threshold for being included in
the list at all.
`minimum`, return `minimum` instead. You may want to set this to the minimum
value contained in the wordlist, to avoid a discontinuity where the wordlist
ends.
Other functions:
`tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
to be installed.
*Tokenization*.
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
the list, in descending frequency order.
@ -133,6 +167,7 @@ The sources (and the abbreviations we'll use for them) are:
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
- **Twitter**: Messages sampled from Twitter's public stream
- **Wpedia**: The full text of Wikipedia in 2015
- **Reddit**: The corpus of Reddit comments through May 2015
- **Other**: We get additional English frequencies from Google Books Syntactic
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
comes with the Jieba tokenizer.
@ -140,33 +175,37 @@ The sources (and the abbreviations we'll use for them) are:
The following 17 languages are well-supported, with reasonable tokenization and
at least 3 different sources of word frequencies:
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other
──────────────────┼─────────────────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes -
German de │ Yes - Yes Yes[1] Yes -
Greek el │ - Yes Yes Yes Yes -
English en │ Yes Yes Yes Yes Yes Google Books
Spanish es │ - Yes Yes Yes Yes -
French fr │ - Yes Yes Yes Yes -
Indonesian id │ - Yes - Yes Yes -
Italian it │ - Yes Yes Yes Yes -
Japanese ja │ - - Yes Yes Yes -
Malay ms │ - Yes - Yes Yes -
Dutch nl │ Yes Yes - Yes Yes -
Polish pl │ - Yes - Yes Yes -
Portuguese pt │ - Yes Yes Yes Yes -
Russian ru │ - Yes Yes Yes Yes -
Swedish sv │ - Yes - Yes Yes -
Turkish tr │ - Yes - Yes Yes -
Chinese zh │ Yes - Yes - - Jieba
Arabic ar │ - Yes Yes Yes Yes - -
German de │ Yes - Yes Yes[1] Yes - -
Greek el │ - Yes Yes Yes Yes - -
English en │ Yes Yes Yes Yes Yes Yes Google Books
Spanish es │ - Yes Yes Yes Yes - -
French fr │ - Yes Yes Yes Yes - -
Indonesian id │ - Yes - Yes Yes - -
Italian it │ - Yes Yes Yes Yes - -
Japanese ja │ - - Yes Yes Yes - -
Malay ms │ - Yes - Yes Yes - -
Dutch nl │ Yes Yes - Yes Yes - -
Polish pl │ - Yes - Yes Yes - -
Portuguese pt │ - Yes Yes Yes Yes - -
Russian ru │ - Yes Yes Yes Yes - -
Swedish sv │ - Yes - Yes Yes - -
Turkish tr │ - Yes - Yes Yes - -
Chinese zh │ Yes - Yes - - - Jieba
Additionally, Korean is marginally supported. You can look up frequencies in
it, but we have too few data sources for it so far:
it, but it will be insufficiently tokenized into words, and we have too few
data sources for it so far:
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
──────────────────┼───────────────────────────────────────
Korean ko │ - - - Yes Yes
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit
──────────────────┼───────────────────────────────────────────────
Korean ko │ - - - Yes Yes -
The 'large' wordlists are available in English, German, Spanish, French, and
Portuguese.
[1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter.
@ -179,7 +218,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
implementation of regular expressions than the standard library, to
separate text into tokens that can be counted consistently. `regex`
produces tokens that follow the recommendations in [Unicode
Annex #29, Text Segmentation][uax29].
Annex #29, Text Segmentation][uax29], including the optional rule that
splits words between apostrophes and vowels.
There are language-specific exceptions:
@ -199,10 +239,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
also try to deal gracefully when you query it with texts that actually break
into multiple tokens:
>>> word_frequency('New York', 'en')
0.0002315934248950231
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
3.2187603965715087e-06
>>> zipf_frequency('New York', 'en')
5.31
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.51
The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese,
@ -216,8 +256,8 @@ frequencies, because that would assume they are statistically unrelated. So if
you give it an uncommon combination of tokens, it will hugely over-estimate
their frequency:
>>> word_frequency('owl-flavored', 'en')
1.3557098723512335e-06
>>> zipf_frequency('owl-flavored', 'en')
3.18
## License

View File

@ -34,7 +34,7 @@ if sys.version_info < (3, 4):
setup(
name="wordfreq",
version='1.3',
version='1.4',
maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq/',

View File

@ -8,6 +8,7 @@ import itertools
import pathlib
import random
import logging
import math
logger = logging.getLogger(__name__)
@ -146,6 +147,42 @@ def cB_to_freq(cB):
return 10 ** (cB / 100)
def cB_to_zipf(cB):
"""
Convert a word frequency from centibels to the Zipf scale
(see `zipf_to_freq`).
The Zipf scale is related to centibels, the logarithmic unit that wordfreq
uses internally, because the Zipf unit is simply the bel, with a different
zero point. To convert centibels to Zipf, add 900 and divide by 100.
"""
return (cB + 900) / 100
def zipf_to_freq(zipf):
"""
Convert a word frequency from the Zipf scale to a proportion between 0 and
1.
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
who compiled the SUBTLEX data. The goal of the Zipf scale is to map
reasonable word frequencies to understandable, small positive numbers.
A word rates as x on the Zipf scale when it occurs 10**x times per billion
words. For example, a word that occurs once per million words is at 3.0 on
the Zipf scale.
"""
return 10 ** zipf / 1e9
def freq_to_zipf(freq):
"""
Convert a word frequency from a proportion between 0 and 1 to the
Zipf scale (see `zipf_to_freq`).
"""
return math.log(freq, 10) + 9
@lru_cache(maxsize=None)
def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
"""
@ -202,6 +239,7 @@ def _word_frequency(word, lang, wordlist, minimum):
return max(freq, minimum)
def word_frequency(word, lang, wordlist='combined', minimum=0.):
"""
Get the frequency of `word` in the language with code `lang`, from the
@ -240,6 +278,33 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
return _wf_cache[args]
def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
"""
Get the frequency of `word`, in the language with code `lang`, on the Zipf
scale.
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
who compiled the SUBTLEX data. The goal of the Zipf scale is to map
reasonable word frequencies to understandable, small positive numbers.
A word rates as x on the Zipf scale when it occurs 10**x times per billion
words. For example, a word that occurs once per million words is at 3.0 on
the Zipf scale.
Zipf values for reasonable words are between 0 and 8. The value this
function returns will always be at last as large as `minimum`, even for a
word that never appears. The default minimum is 0, representing words
that appear once per billion words or less.
wordfreq internally quantizes its frequencies to centibels, which are
1/100 of a Zipf unit. The output of `zipf_frequency` will be rounded to
the nearest hundredth to match this quantization.
"""
freq_min = zipf_to_freq(minimum)
freq = word_frequency(word, lang, wordlist, freq_min)
return round(freq_to_zipf(freq), 2)
@lru_cache(maxsize=100)
def top_n_list(lang, n, wordlist='combined', ascii_only=False):
"""

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

View File

@ -46,6 +46,9 @@ rule simplify_chinese
rule tokenize_twitter
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix
rule tokenize_reddit
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_reddit $in $prefix
# To convert the Leeds corpus, look for space-separated lines that start with
# an integer and a decimal. The integer is the rank, which we discard. The
# decimal is the frequency, and the remaining text is the term. Use sed -n
@ -95,10 +98,10 @@ rule merge_counts
command = python -m wordfreq_builder.cli.merge_counts -o $out -c $cutoff $in
rule freqs2cB
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out -b $buckets
rule cat
command = cat $in > $out
rule extract_reddit
command = bunzip2 -c $in | $JQ -r '.body' | fgrep -v '[deleted]' | sed 's/&gt;/>/g' | sed 's/&lt;/</g' | sed 's/&amp;/\&/g' | gzip -c > $out
command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/&gt;/>/g' | sed 's/&lt;/</g' | sed 's/&amp;/\&/g' > $out

View File

@ -2,12 +2,12 @@ from setuptools import setup
setup(
name="wordfreq_builder",
version='0.1',
version='0.2',
maintainer='Luminoso Technologies, Inc.',
maintainer_email='info@luminoso.com',
url='http://github.com/LuminosoInsight/wordfreq_builder',
platforms=["any"],
description="Turns raw data into word frequency lists",
packages=['wordfreq_builder'],
install_requires=['msgpack-python', 'pycld2']
install_requires=['msgpack-python', 'pycld2', 'langcodes']
)

View File

@ -6,5 +6,9 @@ if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('filename_in', help='name of input file containing tokens')
parser.add_argument('filename_out', help='name of output file')
parser.add_argument('-b', '--buckets', type=int, default=600,
help='Number of centibel buckets to include (default 600). '
'Increasing this number creates a longer wordlist with '
'rarer words.')
args = parser.parse_args()
freqs_to_cBpack(args.filename_in, args.filename_out)
freqs_to_cBpack(args.filename_in, args.filename_out, cutoff=-(args.buckets))

View File

@ -2,10 +2,10 @@ from wordfreq_builder.word_counts import read_values, merge_counts, write_wordli
import argparse
def merge_lists(input_names, output_name, cutoff=0):
def merge_lists(input_names, output_name, cutoff=0, max_words=1000000):
count_dicts = []
for input_name in input_names:
values, total = read_values(input_name, cutoff=cutoff, max_size=1000000)
values, total = read_values(input_name, cutoff=cutoff, max_words=max_words)
count_dicts.append(values)
merged = merge_counts(count_dicts)
write_wordlist(merged, output_name)
@ -17,8 +17,9 @@ if __name__ == '__main__':
help='filename to write the output to')
parser.add_argument('-c', '--cutoff', type=int, default=0,
help='minimum count to read from an input file')
parser.add_argument('-m', '--max-words', type=int, default=1000000,
help='maximum number of words to read from each list')
parser.add_argument('inputs', nargs='+',
help='names of input files to merge')
args = parser.parse_args()
merge_lists(args.inputs, args.output, cutoff=args.cutoff)
merge_lists(args.inputs, args.output, cutoff=args.cutoff, max_words=args.max_words)

View File

@ -1,13 +1,17 @@
from wordfreq_builder.tokenizers import cld2_reddit_tokenizer, tokenize_by_language
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_by_language
import argparse
def reddit_tokenizer(text):
return cld2_surface_tokenizer(text, mode='reddit')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename', help='filename of input file containing one comment per line')
parser.add_argument('outprefix', help='prefix of output filenames')
args = parser.parse_args()
tokenize_by_language(args.filename, args.outprefix, tokenizer=cld2_reddit_tokenizer)
tokenize_by_language(args.filename, args.outprefix, tokenizer=reddit_tokenizer)
if __name__ == '__main__':

View File

@ -41,7 +41,11 @@ CONFIG = {
'subtlex-en': ['en'],
'subtlex-other': ['de', 'nl', 'zh'],
'jieba': ['zh'],
'reddit': ['en'],
# About 99.2% of Reddit is in English. There are pockets of
# conversation in other languages, but we're concerned that they're not
# representative enough for learning general word frequencies.
'reddit': ['en']
},
# Subtlex languages that need to be pre-processed
'wordlist_paths': {
@ -56,10 +60,12 @@ CONFIG = {
'reddit': 'generated/reddit/reddit_{lang}.{ext}',
'combined': 'generated/combined/combined_{lang}.{ext}',
'combined-dist': 'dist/combined_{lang}.{ext}',
'combined-dist-large': 'dist/large_{lang}.{ext}',
'twitter-dist': 'dist/twitter_{lang}.{ext}',
'jieba-dist': 'dist/jieba_{lang}.{ext}'
},
'min_sources': 2
'min_sources': 2,
'big-lists': ['en', 'fr', 'es', 'pt', 'de']
}

View File

@ -4,6 +4,8 @@ from wordfreq_builder.config import (
import sys
import pathlib
import itertools
from collections import defaultdict
HEADER = """# This file is automatically generated. Do not edit it.
# You can change its behavior by editing wordfreq_builder/ninja.py,
@ -155,14 +157,12 @@ def twitter_deps(input_filename, slice_prefix, combined_prefix, slices,
for language in languages:
combined_output = wordlist_filename('twitter', language, 'tokens.txt')
language_inputs = [
'{prefix}.{lang}.txt'.format(
prefix=slice_files[slicenum], lang=language
)
for slicenum in range(slices)
]
add_dep(lines, 'cat', language_inputs, combined_output)
count_file = wordlist_filename('twitter', language, 'counts.txt')
@ -238,23 +238,40 @@ def jieba_deps(dirname_in, languages):
def reddit_deps(dirname_in, languages):
lines = []
if not languages:
return lines
assert languages == ['en']
processed_files = []
path_in = pathlib.Path(dirname_in)
for filepath in path_in.glob('*/*.bz2'):
base = filepath.name[:-4]
transformed_file = wordlist_filename('reddit', 'en', base + '.txt.gz')
add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
count_file = wordlist_filename('reddit', 'en', base + '.counts.txt')
add_dep(lines, 'count', transformed_file, count_file)
processed_files.append(count_file)
slices = {}
counts_by_language = defaultdict(list)
output_file = wordlist_filename('reddit', 'en', 'counts.txt')
# Extract text from the Reddit comment dumps, and write them to
# .txt.gz files
for filepath in path_in.glob('*/*.bz2'):
base = filepath.stem
transformed_file = wordlist_filename('reddit', base + '.all', 'txt')
slices[base] = transformed_file
add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
for base in sorted(slices):
transformed_file = slices[base]
language_outputs = []
for language in languages:
filename = wordlist_filename('reddit', base + '.' + language, 'txt')
language_outputs.append(filename)
count_filename = wordlist_filename('reddit', base + '.' + language, 'counts.txt')
add_dep(lines, 'count', filename, count_filename)
counts_by_language[language].append(count_filename)
# find the prefix by constructing a filename, then stripping off
# '.xx.txt' from the end
prefix = wordlist_filename('reddit', base + '.xx', 'txt')[:-7]
add_dep(lines, 'tokenize_reddit', transformed_file, language_outputs,
params={'prefix': prefix},
extra='wordfreq_builder/tokenizers.py')
for language in languages:
output_file = wordlist_filename('reddit', language, 'counts.txt')
add_dep(
lines, 'merge_counts', processed_files, output_file,
lines, 'merge_counts', counts_by_language[language], output_file,
params={'cutoff': 3}
)
return lines
@ -345,11 +362,19 @@ def combine_lists(languages):
output_cBpack = wordlist_filename(
'combined-dist', language, 'msgpack.gz'
)
output_cBpack_big = wordlist_filename(
'combined-dist-large', language, 'msgpack.gz'
)
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
extra='wordfreq_builder/word_counts.py',
params={'lang': language})
params={'lang': language, 'buckets': 600})
add_dep(lines, 'freqs2cB', output_file, output_cBpack_big,
extra='wordfreq_builder/word_counts.py',
params={'lang': language, 'buckets': 800})
lines.append('default {}'.format(output_cBpack))
if language in CONFIG['big-lists']:
lines.append('default {}'.format(output_cBpack_big))
# Write standalone lists for Twitter frequency
if language in CONFIG['sources']['twitter']:
@ -358,7 +383,7 @@ def combine_lists(languages):
'twitter-dist', language, 'msgpack.gz')
add_dep(lines, 'freqs2cB', input_file, output_cBpack,
extra='wordfreq_builder/word_counts.py',
params={'lang': language})
params={'lang': language, 'buckets': 600})
lines.append('default {}'.format(output_cBpack))

View File

@ -2,6 +2,7 @@ from wordfreq import tokenize
from ftfy.fixes import unescape_html
import regex
import pycld2
import langcodes
CLD2_BAD_CHAR_RANGE = "[%s]" % "".join(
[
@ -26,48 +27,63 @@ URL_RE = regex.compile(r'http(?:s)?://[^) ]*')
MARKDOWN_URL_RESIDUE_RE = regex.compile(r'\]\(\)')
def cld2_surface_tokenizer(text):
"""
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
"""
text = unescape_html(text)
text = TWITTER_HANDLE_RE.sub('', text)
text = TCO_RE.sub('', text)
# Low-frequency languages tend to be detected incorrectly by cld2. The
# following list of languages are languages that appear in our data with any
# reasonable frequency, and seem to usually be detected *correctly*. These are
# the languages we'll keep in the Reddit and Twitter results.
#
# This list is larger than the list that wordfreq ultimately generates, so we
# can look here as a source of future data.
lang = cld2_detect_language(text)
# Don't allow tokenization in Chinese when language-detecting, because
# the Chinese tokenizer may not be built yet
if lang == 'zh':
lang = 'en'
tokens = tokenize(text, lang)
return lang, tokens
# Low-frequency languages tend to be detected incorrectly. Keep a limited
# list of languages we're allowed to use here.
KEEP_THESE_LANGUAGES = {
'ar', 'de', 'el', 'en', 'es', 'fr', 'hr', 'id', 'it', 'ja', 'ko', 'ms',
'nl', 'pl', 'pt', 'ro', 'ru', 'sv'
'af', 'ar', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fi',
'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'ko', 'lv',
'ms', 'nl', 'nn', 'no', 'pl', 'pt', 'ro', 'ru', 'sr', 'sv', 'sw', 'tl',
'tr', 'uk', 'vi'
}
# Semi-frequent languages that are excluded by the above:
#
# - Chinese, not because it's detected incorrectly, but because we can't
# handle it until we already have word frequencies
# - Thai (seems to be detected whenever someone uses Thai characters in
# an emoticon)
# - Welsh (which is detected for "ohmygodohmygodohmygod")
# - Turkmen (detected for ASCII art)
# - Irish Gaelic (detected for Cthulhu-related text)
# - Kannada (looks of disapproval)
# - Lao, Tamil, Xhosa, Slovak (various emoticons and Internet memes)
# - Breton (the word "memes" itself)
def cld2_reddit_tokenizer(text):
def cld2_surface_tokenizer(text, mode='twitter'):
"""
A language-detecting tokenizer with special cases for handling text from
Reddit.
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
The `mode` can be 'twitter' or 'reddit', which slightly changes the
pre-processing of the text.
"""
text = unescape_html(text)
if mode == 'twitter':
text = TWITTER_HANDLE_RE.sub('', text)
text = TCO_RE.sub('', text)
elif mode == 'reddit':
text = URL_RE.sub('', text)
text = MARKDOWN_URL_RESIDUE_RE.sub(']', text)
lang = cld2_detect_language(text)
if lang not in KEEP_THESE_LANGUAGES:
# Reddit is 99.9% English, so if we detected a rare language, it's
# much more likely that it's actually English.
lang = 'en'
tokens = tokenize(text, lang, include_punctuation=True)
# If the detected language isn't in our pretty generous list of languages,
# return no tokens.
if lang not in KEEP_THESE_LANGUAGES:
return 'xx', []
# cld2's accuracy seems to improve dramatically with at least 50
# bytes of input, so throw away non-English below this length.
if len(text.encode('utf-8')) < 50 and lang != 'en':
return 'xx', []
tokens = tokenize(text, lang)
return lang, tokens
@ -85,7 +101,12 @@ def cld2_detect_language(text):
# Confidence score: float))
text = CLD2_BAD_CHARS_RE.sub('', text)
return pycld2.detect(text)[2][0][1]
lang = pycld2.detect(text)[2][0][1]
# Normalize the language code: 'iw' becomes 'he', and 'zh-Hant'
# becomes 'zh'
code = langcodes.get(lang).language
return code
def tokenize_by_language(in_filename, out_prefix, tokenizer):
@ -95,19 +116,17 @@ def tokenize_by_language(in_filename, out_prefix, tokenizer):
Produces output files that are separated by language, with spaces
between the tokens.
"""
out_files = {}
out_files = {
language: open('%s.%s.txt' % (out_prefix, language), 'w', encoding='utf-8')
for language in KEEP_THESE_LANGUAGES
}
with open(in_filename, encoding='utf-8') as in_file:
for line in in_file:
text = line.split('\t')[-1].strip()
language, tokens = tokenizer(text)
if language != 'un':
if language in KEEP_THESE_LANGUAGES:
out_file = out_files[language]
tokenized = ' '.join(tokens)
out_filename = '%s.%s.txt' % (out_prefix, language)
if out_filename in out_files:
out_file = out_files[out_filename]
else:
out_file = open(out_filename, 'w', encoding='utf-8')
out_files[out_filename] = out_file
print(tokenized, file=out_file)
for out_file in out_files.values():
out_file.close()

View File

@ -36,15 +36,17 @@ def count_tokens(filename):
return counts
def read_values(filename, cutoff=0, max_size=1e8, lang=None):
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
"""
Read words and their frequency or count values from a CSV file. Returns
a dictionary of values and the total of all values.
Only words with a value greater than or equal to `cutoff` are returned.
In addition, only up to `max_words` words are read.
If `cutoff` is greater than 0, the csv file must be sorted by value
in descending order.
If `cutoff` is greater than 0 or `max_words` is smaller than the list,
the csv file must be sorted by value in descending order, so that the
most frequent words are kept.
If `lang` is given, it will apply language-specific tokenization to the
words that it reads.
@ -55,7 +57,7 @@ def read_values(filename, cutoff=0, max_size=1e8, lang=None):
for key, strval in csv.reader(infile):
val = float(strval)
key = fix_text(key)
if val < cutoff or len(values) >= max_size:
if val < cutoff or len(values) >= max_words:
break
tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
for token in tokens: