mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection
Former-commit-id: e7b34fb655
This commit is contained in:
commit
7a55e0ed86
124
README.md
124
README.md
@ -39,11 +39,18 @@ For example:
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 18 languages (see *Supported languages* below). It loads
|
||||
efficiently-packed data structures that contain all words that appear at least
|
||||
once per million words.
|
||||
used, in 18 languages (see *Supported languages* below).
|
||||
|
||||
The most useful function is:
|
||||
It provides three kinds of pre-built wordlists:
|
||||
|
||||
- `'combined'` lists, containing words that appear at least once per
|
||||
million words, averaged across all data sources.
|
||||
- `'twitter'` lists, containing words that appear at least once per
|
||||
million words on Twitter alone.
|
||||
- `'large'` lists, containing words that appear at least once per 100
|
||||
million words, averaged across all data sources.
|
||||
|
||||
The most straightforward function is:
|
||||
|
||||
word_frequency(word, lang, wordlist='combined', minimum=0.0)
|
||||
|
||||
@ -64,7 +71,37 @@ frequencies by a million (1e6) to get more readable numbers:
|
||||
>>> word_frequency('café', 'fr') * 1e6
|
||||
77.62471166286912
|
||||
|
||||
The parameters are:
|
||||
|
||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
||||
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
||||
of a word is the base-10 logarithm of the number of times it appears per
|
||||
billion words. A word with Zipf value 6 appears once per thousand words, for
|
||||
example, and a word with Zipf value 3 appears once per million words.
|
||||
|
||||
Reasonable Zipf values are between 0 and 8, but because of the cutoffs
|
||||
described above, the minimum Zipf value appearing in these lists is 1.0 for the
|
||||
'large' wordlists and 3.0 for all others. We use 0 as the default Zipf value
|
||||
for words that do not appear in the given wordlist, although it should mean
|
||||
one occurrence per billion words.
|
||||
|
||||
>>> zipf_frequency('the', 'en')
|
||||
7.59
|
||||
|
||||
>>> zipf_frequency('word', 'en')
|
||||
5.34
|
||||
|
||||
>>> zipf_frequency('frequency', 'en')
|
||||
4.44
|
||||
|
||||
>>> zipf_frequency('zipf', 'en')
|
||||
0.0
|
||||
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='large')
|
||||
1.42
|
||||
|
||||
|
||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
|
||||
- `word`: a Unicode string containing the word to look up. Ideally the word
|
||||
is a single token according to our tokenizer, but if not, there is still
|
||||
@ -73,21 +110,18 @@ The parameters are:
|
||||
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
|
||||
|
||||
- `wordlist`: which set of word frequencies to use. Current options are
|
||||
'combined', which combines up to five different sources, and
|
||||
'twitter', which returns frequencies observed on Twitter alone.
|
||||
'combined', 'twitter', and 'large'.
|
||||
|
||||
- `minimum`: If the word is not in the list or has a frequency lower than
|
||||
`minimum`, return `minimum` instead. In some applications, you'll want
|
||||
to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
|
||||
a frequency of 1e-6 (1 per million) is the threshold for being included in
|
||||
the list at all.
|
||||
`minimum`, return `minimum` instead. You may want to set this to the minimum
|
||||
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
||||
ends.
|
||||
|
||||
Other functions:
|
||||
|
||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||
way that the words in wordfreq's data were counted in the first place. See
|
||||
*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
|
||||
to be installed.
|
||||
*Tokenization*.
|
||||
|
||||
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
|
||||
the list, in descending frequency order.
|
||||
@ -133,6 +167,7 @@ The sources (and the abbreviations we'll use for them) are:
|
||||
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
||||
- **Twitter**: Messages sampled from Twitter's public stream
|
||||
- **Wpedia**: The full text of Wikipedia in 2015
|
||||
- **Reddit**: The corpus of Reddit comments through May 2015
|
||||
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||
comes with the Jieba tokenizer.
|
||||
@ -140,33 +175,37 @@ The sources (and the abbreviations we'll use for them) are:
|
||||
The following 17 languages are well-supported, with reasonable tokenization and
|
||||
at least 3 different sources of word frequencies:
|
||||
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other
|
||||
──────────────────┼─────────────────────────────────────────────────────
|
||||
Arabic ar │ - Yes Yes Yes Yes -
|
||||
German de │ Yes - Yes Yes[1] Yes -
|
||||
Greek el │ - Yes Yes Yes Yes -
|
||||
English en │ Yes Yes Yes Yes Yes Google Books
|
||||
Spanish es │ - Yes Yes Yes Yes -
|
||||
French fr │ - Yes Yes Yes Yes -
|
||||
Indonesian id │ - Yes - Yes Yes -
|
||||
Italian it │ - Yes Yes Yes Yes -
|
||||
Japanese ja │ - - Yes Yes Yes -
|
||||
Malay ms │ - Yes - Yes Yes -
|
||||
Dutch nl │ Yes Yes - Yes Yes -
|
||||
Polish pl │ - Yes - Yes Yes -
|
||||
Portuguese pt │ - Yes Yes Yes Yes -
|
||||
Russian ru │ - Yes Yes Yes Yes -
|
||||
Swedish sv │ - Yes - Yes Yes -
|
||||
Turkish tr │ - Yes - Yes Yes -
|
||||
Chinese zh │ Yes - Yes - - Jieba
|
||||
Arabic ar │ - Yes Yes Yes Yes - -
|
||||
German de │ Yes - Yes Yes[1] Yes - -
|
||||
Greek el │ - Yes Yes Yes Yes - -
|
||||
English en │ Yes Yes Yes Yes Yes Yes Google Books
|
||||
Spanish es │ - Yes Yes Yes Yes - -
|
||||
French fr │ - Yes Yes Yes Yes - -
|
||||
Indonesian id │ - Yes - Yes Yes - -
|
||||
Italian it │ - Yes Yes Yes Yes - -
|
||||
Japanese ja │ - - Yes Yes Yes - -
|
||||
Malay ms │ - Yes - Yes Yes - -
|
||||
Dutch nl │ Yes Yes - Yes Yes - -
|
||||
Polish pl │ - Yes - Yes Yes - -
|
||||
Portuguese pt │ - Yes Yes Yes Yes - -
|
||||
Russian ru │ - Yes Yes Yes Yes - -
|
||||
Swedish sv │ - Yes - Yes Yes - -
|
||||
Turkish tr │ - Yes - Yes Yes - -
|
||||
Chinese zh │ Yes - Yes - - - Jieba
|
||||
|
||||
|
||||
Additionally, Korean is marginally supported. You can look up frequencies in
|
||||
it, but we have too few data sources for it so far:
|
||||
it, but it will be insufficiently tokenized into words, and we have too few
|
||||
data sources for it so far:
|
||||
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
|
||||
──────────────────┼───────────────────────────────────────
|
||||
Korean ko │ - - - Yes Yes
|
||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit
|
||||
──────────────────┼───────────────────────────────────────────────
|
||||
Korean ko │ - - - Yes Yes -
|
||||
|
||||
The 'large' wordlists are available in English, German, Spanish, French, and
|
||||
Portuguese.
|
||||
|
||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||
you should be aware that German is not a frequently-used language on Twitter.
|
||||
@ -179,7 +218,8 @@ wordfreq uses the Python package `regex`, which is a more advanced
|
||||
implementation of regular expressions than the standard library, to
|
||||
separate text into tokens that can be counted consistently. `regex`
|
||||
produces tokens that follow the recommendations in [Unicode
|
||||
Annex #29, Text Segmentation][uax29].
|
||||
Annex #29, Text Segmentation][uax29], including the optional rule that
|
||||
splits words between apostrophes and vowels.
|
||||
|
||||
There are language-specific exceptions:
|
||||
|
||||
@ -199,10 +239,10 @@ Because tokenization in the real world is far from consistent, wordfreq will
|
||||
also try to deal gracefully when you query it with texts that actually break
|
||||
into multiple tokens:
|
||||
|
||||
>>> word_frequency('New York', 'en')
|
||||
0.0002315934248950231
|
||||
>>> word_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.2187603965715087e-06
|
||||
>>> zipf_frequency('New York', 'en')
|
||||
5.31
|
||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||
3.51
|
||||
|
||||
The word frequencies are combined with the half-harmonic-mean function in order
|
||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||
@ -216,8 +256,8 @@ frequencies, because that would assume they are statistically unrelated. So if
|
||||
you give it an uncommon combination of tokens, it will hugely over-estimate
|
||||
their frequency:
|
||||
|
||||
>>> word_frequency('owl-flavored', 'en')
|
||||
1.3557098723512335e-06
|
||||
>>> zipf_frequency('owl-flavored', 'en')
|
||||
3.18
|
||||
|
||||
|
||||
## License
|
||||
|
2
setup.py
2
setup.py
@ -34,7 +34,7 @@ if sys.version_info < (3, 4):
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='1.3',
|
||||
version='1.4',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq/',
|
||||
|
@ -8,6 +8,7 @@ import itertools
|
||||
import pathlib
|
||||
import random
|
||||
import logging
|
||||
import math
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@ -146,6 +147,42 @@ def cB_to_freq(cB):
|
||||
return 10 ** (cB / 100)
|
||||
|
||||
|
||||
def cB_to_zipf(cB):
|
||||
"""
|
||||
Convert a word frequency from centibels to the Zipf scale
|
||||
(see `zipf_to_freq`).
|
||||
|
||||
The Zipf scale is related to centibels, the logarithmic unit that wordfreq
|
||||
uses internally, because the Zipf unit is simply the bel, with a different
|
||||
zero point. To convert centibels to Zipf, add 900 and divide by 100.
|
||||
"""
|
||||
return (cB + 900) / 100
|
||||
|
||||
|
||||
def zipf_to_freq(zipf):
|
||||
"""
|
||||
Convert a word frequency from the Zipf scale to a proportion between 0 and
|
||||
1.
|
||||
|
||||
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
|
||||
who compiled the SUBTLEX data. The goal of the Zipf scale is to map
|
||||
reasonable word frequencies to understandable, small positive numbers.
|
||||
|
||||
A word rates as x on the Zipf scale when it occurs 10**x times per billion
|
||||
words. For example, a word that occurs once per million words is at 3.0 on
|
||||
the Zipf scale.
|
||||
"""
|
||||
return 10 ** zipf / 1e9
|
||||
|
||||
|
||||
def freq_to_zipf(freq):
|
||||
"""
|
||||
Convert a word frequency from a proportion between 0 and 1 to the
|
||||
Zipf scale (see `zipf_to_freq`).
|
||||
"""
|
||||
return math.log(freq, 10) + 9
|
||||
|
||||
|
||||
@lru_cache(maxsize=None)
|
||||
def get_frequency_dict(lang, wordlist='combined', match_cutoff=30):
|
||||
"""
|
||||
@ -202,6 +239,7 @@ def _word_frequency(word, lang, wordlist, minimum):
|
||||
|
||||
return max(freq, minimum)
|
||||
|
||||
|
||||
def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
"""
|
||||
Get the frequency of `word` in the language with code `lang`, from the
|
||||
@ -240,6 +278,33 @@ def word_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
return _wf_cache[args]
|
||||
|
||||
|
||||
def zipf_frequency(word, lang, wordlist='combined', minimum=0.):
|
||||
"""
|
||||
Get the frequency of `word`, in the language with code `lang`, on the Zipf
|
||||
scale.
|
||||
|
||||
The Zipf scale is a logarithmic frequency scale proposed by Marc Brysbaert,
|
||||
who compiled the SUBTLEX data. The goal of the Zipf scale is to map
|
||||
reasonable word frequencies to understandable, small positive numbers.
|
||||
|
||||
A word rates as x on the Zipf scale when it occurs 10**x times per billion
|
||||
words. For example, a word that occurs once per million words is at 3.0 on
|
||||
the Zipf scale.
|
||||
|
||||
Zipf values for reasonable words are between 0 and 8. The value this
|
||||
function returns will always be at last as large as `minimum`, even for a
|
||||
word that never appears. The default minimum is 0, representing words
|
||||
that appear once per billion words or less.
|
||||
|
||||
wordfreq internally quantizes its frequencies to centibels, which are
|
||||
1/100 of a Zipf unit. The output of `zipf_frequency` will be rounded to
|
||||
the nearest hundredth to match this quantization.
|
||||
"""
|
||||
freq_min = zipf_to_freq(minimum)
|
||||
freq = word_frequency(word, lang, wordlist, freq_min)
|
||||
return round(freq_to_zipf(freq), 2)
|
||||
|
||||
|
||||
@lru_cache(maxsize=100)
|
||||
def top_n_list(lang, n, wordlist='combined', ascii_only=False):
|
||||
"""
|
||||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
BIN
wordfreq/data/large_de.msgpack.gz
Normal file
BIN
wordfreq/data/large_de.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_en.msgpack.gz
Normal file
BIN
wordfreq/data/large_en.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_es.msgpack.gz
Normal file
BIN
wordfreq/data/large_es.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_fr.msgpack.gz
Normal file
BIN
wordfreq/data/large_fr.msgpack.gz
Normal file
Binary file not shown.
BIN
wordfreq/data/large_pt.msgpack.gz
Normal file
BIN
wordfreq/data/large_pt.msgpack.gz
Normal file
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -46,6 +46,9 @@ rule simplify_chinese
|
||||
rule tokenize_twitter
|
||||
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_twitter $in $prefix
|
||||
|
||||
rule tokenize_reddit
|
||||
command = mkdir -p $$(dirname $prefix) && python -m wordfreq_builder.cli.tokenize_reddit $in $prefix
|
||||
|
||||
# To convert the Leeds corpus, look for space-separated lines that start with
|
||||
# an integer and a decimal. The integer is the rank, which we discard. The
|
||||
# decimal is the frequency, and the remaining text is the term. Use sed -n
|
||||
@ -95,10 +98,10 @@ rule merge_counts
|
||||
command = python -m wordfreq_builder.cli.merge_counts -o $out -c $cutoff $in
|
||||
|
||||
rule freqs2cB
|
||||
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out
|
||||
command = python -m wordfreq_builder.cli.freqs_to_cB $in $out -b $buckets
|
||||
|
||||
rule cat
|
||||
command = cat $in > $out
|
||||
|
||||
rule extract_reddit
|
||||
command = bunzip2 -c $in | $JQ -r '.body' | fgrep -v '[deleted]' | sed 's/>/>/g' | sed 's/</</g' | sed 's/&/\&/g' | gzip -c > $out
|
||||
command = bunzip2 -c $in | $JQ -r 'select(.score > 0) | .body' | fgrep -v '[deleted]' | sed 's/>/>/g' | sed 's/</</g' | sed 's/&/\&/g' > $out
|
||||
|
@ -2,12 +2,12 @@ from setuptools import setup
|
||||
|
||||
setup(
|
||||
name="wordfreq_builder",
|
||||
version='0.1',
|
||||
version='0.2',
|
||||
maintainer='Luminoso Technologies, Inc.',
|
||||
maintainer_email='info@luminoso.com',
|
||||
url='http://github.com/LuminosoInsight/wordfreq_builder',
|
||||
platforms=["any"],
|
||||
description="Turns raw data into word frequency lists",
|
||||
packages=['wordfreq_builder'],
|
||||
install_requires=['msgpack-python', 'pycld2']
|
||||
install_requires=['msgpack-python', 'pycld2', 'langcodes']
|
||||
)
|
||||
|
@ -6,5 +6,9 @@ if __name__ == '__main__':
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename_in', help='name of input file containing tokens')
|
||||
parser.add_argument('filename_out', help='name of output file')
|
||||
parser.add_argument('-b', '--buckets', type=int, default=600,
|
||||
help='Number of centibel buckets to include (default 600). '
|
||||
'Increasing this number creates a longer wordlist with '
|
||||
'rarer words.')
|
||||
args = parser.parse_args()
|
||||
freqs_to_cBpack(args.filename_in, args.filename_out)
|
||||
freqs_to_cBpack(args.filename_in, args.filename_out, cutoff=-(args.buckets))
|
||||
|
@ -2,10 +2,10 @@ from wordfreq_builder.word_counts import read_values, merge_counts, write_wordli
|
||||
import argparse
|
||||
|
||||
|
||||
def merge_lists(input_names, output_name, cutoff=0):
|
||||
def merge_lists(input_names, output_name, cutoff=0, max_words=1000000):
|
||||
count_dicts = []
|
||||
for input_name in input_names:
|
||||
values, total = read_values(input_name, cutoff=cutoff, max_size=1000000)
|
||||
values, total = read_values(input_name, cutoff=cutoff, max_words=max_words)
|
||||
count_dicts.append(values)
|
||||
merged = merge_counts(count_dicts)
|
||||
write_wordlist(merged, output_name)
|
||||
@ -17,8 +17,9 @@ if __name__ == '__main__':
|
||||
help='filename to write the output to')
|
||||
parser.add_argument('-c', '--cutoff', type=int, default=0,
|
||||
help='minimum count to read from an input file')
|
||||
parser.add_argument('-m', '--max-words', type=int, default=1000000,
|
||||
help='maximum number of words to read from each list')
|
||||
parser.add_argument('inputs', nargs='+',
|
||||
help='names of input files to merge')
|
||||
args = parser.parse_args()
|
||||
merge_lists(args.inputs, args.output, cutoff=args.cutoff)
|
||||
|
||||
merge_lists(args.inputs, args.output, cutoff=args.cutoff, max_words=args.max_words)
|
||||
|
@ -1,13 +1,17 @@
|
||||
from wordfreq_builder.tokenizers import cld2_reddit_tokenizer, tokenize_by_language
|
||||
from wordfreq_builder.tokenizers import cld2_surface_tokenizer, tokenize_by_language
|
||||
import argparse
|
||||
|
||||
|
||||
def reddit_tokenizer(text):
|
||||
return cld2_surface_tokenizer(text, mode='reddit')
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('filename', help='filename of input file containing one comment per line')
|
||||
parser.add_argument('outprefix', help='prefix of output filenames')
|
||||
args = parser.parse_args()
|
||||
tokenize_by_language(args.filename, args.outprefix, tokenizer=cld2_reddit_tokenizer)
|
||||
tokenize_by_language(args.filename, args.outprefix, tokenizer=reddit_tokenizer)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
|
@ -41,7 +41,11 @@ CONFIG = {
|
||||
'subtlex-en': ['en'],
|
||||
'subtlex-other': ['de', 'nl', 'zh'],
|
||||
'jieba': ['zh'],
|
||||
'reddit': ['en'],
|
||||
|
||||
# About 99.2% of Reddit is in English. There are pockets of
|
||||
# conversation in other languages, but we're concerned that they're not
|
||||
# representative enough for learning general word frequencies.
|
||||
'reddit': ['en']
|
||||
},
|
||||
# Subtlex languages that need to be pre-processed
|
||||
'wordlist_paths': {
|
||||
@ -56,10 +60,12 @@ CONFIG = {
|
||||
'reddit': 'generated/reddit/reddit_{lang}.{ext}',
|
||||
'combined': 'generated/combined/combined_{lang}.{ext}',
|
||||
'combined-dist': 'dist/combined_{lang}.{ext}',
|
||||
'combined-dist-large': 'dist/large_{lang}.{ext}',
|
||||
'twitter-dist': 'dist/twitter_{lang}.{ext}',
|
||||
'jieba-dist': 'dist/jieba_{lang}.{ext}'
|
||||
},
|
||||
'min_sources': 2
|
||||
'min_sources': 2,
|
||||
'big-lists': ['en', 'fr', 'es', 'pt', 'de']
|
||||
}
|
||||
|
||||
|
||||
|
@ -4,6 +4,8 @@ from wordfreq_builder.config import (
|
||||
import sys
|
||||
import pathlib
|
||||
import itertools
|
||||
from collections import defaultdict
|
||||
|
||||
|
||||
HEADER = """# This file is automatically generated. Do not edit it.
|
||||
# You can change its behavior by editing wordfreq_builder/ninja.py,
|
||||
@ -155,14 +157,12 @@ def twitter_deps(input_filename, slice_prefix, combined_prefix, slices,
|
||||
|
||||
for language in languages:
|
||||
combined_output = wordlist_filename('twitter', language, 'tokens.txt')
|
||||
|
||||
language_inputs = [
|
||||
'{prefix}.{lang}.txt'.format(
|
||||
prefix=slice_files[slicenum], lang=language
|
||||
)
|
||||
for slicenum in range(slices)
|
||||
]
|
||||
|
||||
add_dep(lines, 'cat', language_inputs, combined_output)
|
||||
|
||||
count_file = wordlist_filename('twitter', language, 'counts.txt')
|
||||
@ -238,23 +238,40 @@ def jieba_deps(dirname_in, languages):
|
||||
|
||||
def reddit_deps(dirname_in, languages):
|
||||
lines = []
|
||||
if not languages:
|
||||
return lines
|
||||
assert languages == ['en']
|
||||
|
||||
processed_files = []
|
||||
path_in = pathlib.Path(dirname_in)
|
||||
for filepath in path_in.glob('*/*.bz2'):
|
||||
base = filepath.name[:-4]
|
||||
transformed_file = wordlist_filename('reddit', 'en', base + '.txt.gz')
|
||||
add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
|
||||
count_file = wordlist_filename('reddit', 'en', base + '.counts.txt')
|
||||
add_dep(lines, 'count', transformed_file, count_file)
|
||||
processed_files.append(count_file)
|
||||
slices = {}
|
||||
counts_by_language = defaultdict(list)
|
||||
|
||||
output_file = wordlist_filename('reddit', 'en', 'counts.txt')
|
||||
# Extract text from the Reddit comment dumps, and write them to
|
||||
# .txt.gz files
|
||||
for filepath in path_in.glob('*/*.bz2'):
|
||||
base = filepath.stem
|
||||
transformed_file = wordlist_filename('reddit', base + '.all', 'txt')
|
||||
slices[base] = transformed_file
|
||||
add_dep(lines, 'extract_reddit', str(filepath), transformed_file)
|
||||
|
||||
for base in sorted(slices):
|
||||
transformed_file = slices[base]
|
||||
language_outputs = []
|
||||
for language in languages:
|
||||
filename = wordlist_filename('reddit', base + '.' + language, 'txt')
|
||||
language_outputs.append(filename)
|
||||
|
||||
count_filename = wordlist_filename('reddit', base + '.' + language, 'counts.txt')
|
||||
add_dep(lines, 'count', filename, count_filename)
|
||||
counts_by_language[language].append(count_filename)
|
||||
|
||||
# find the prefix by constructing a filename, then stripping off
|
||||
# '.xx.txt' from the end
|
||||
prefix = wordlist_filename('reddit', base + '.xx', 'txt')[:-7]
|
||||
add_dep(lines, 'tokenize_reddit', transformed_file, language_outputs,
|
||||
params={'prefix': prefix},
|
||||
extra='wordfreq_builder/tokenizers.py')
|
||||
|
||||
for language in languages:
|
||||
output_file = wordlist_filename('reddit', language, 'counts.txt')
|
||||
add_dep(
|
||||
lines, 'merge_counts', processed_files, output_file,
|
||||
lines, 'merge_counts', counts_by_language[language], output_file,
|
||||
params={'cutoff': 3}
|
||||
)
|
||||
return lines
|
||||
@ -345,11 +362,19 @@ def combine_lists(languages):
|
||||
output_cBpack = wordlist_filename(
|
||||
'combined-dist', language, 'msgpack.gz'
|
||||
)
|
||||
output_cBpack_big = wordlist_filename(
|
||||
'combined-dist-large', language, 'msgpack.gz'
|
||||
)
|
||||
add_dep(lines, 'freqs2cB', output_file, output_cBpack,
|
||||
extra='wordfreq_builder/word_counts.py',
|
||||
params={'lang': language})
|
||||
params={'lang': language, 'buckets': 600})
|
||||
add_dep(lines, 'freqs2cB', output_file, output_cBpack_big,
|
||||
extra='wordfreq_builder/word_counts.py',
|
||||
params={'lang': language, 'buckets': 800})
|
||||
|
||||
lines.append('default {}'.format(output_cBpack))
|
||||
if language in CONFIG['big-lists']:
|
||||
lines.append('default {}'.format(output_cBpack_big))
|
||||
|
||||
# Write standalone lists for Twitter frequency
|
||||
if language in CONFIG['sources']['twitter']:
|
||||
@ -358,7 +383,7 @@ def combine_lists(languages):
|
||||
'twitter-dist', language, 'msgpack.gz')
|
||||
add_dep(lines, 'freqs2cB', input_file, output_cBpack,
|
||||
extra='wordfreq_builder/word_counts.py',
|
||||
params={'lang': language})
|
||||
params={'lang': language, 'buckets': 600})
|
||||
|
||||
lines.append('default {}'.format(output_cBpack))
|
||||
|
||||
|
@ -2,6 +2,7 @@ from wordfreq import tokenize
|
||||
from ftfy.fixes import unescape_html
|
||||
import regex
|
||||
import pycld2
|
||||
import langcodes
|
||||
|
||||
CLD2_BAD_CHAR_RANGE = "[%s]" % "".join(
|
||||
[
|
||||
@ -26,48 +27,63 @@ URL_RE = regex.compile(r'http(?:s)?://[^) ]*')
|
||||
MARKDOWN_URL_RESIDUE_RE = regex.compile(r'\]\(\)')
|
||||
|
||||
|
||||
def cld2_surface_tokenizer(text):
|
||||
"""
|
||||
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
|
||||
"""
|
||||
text = unescape_html(text)
|
||||
text = TWITTER_HANDLE_RE.sub('', text)
|
||||
text = TCO_RE.sub('', text)
|
||||
# Low-frequency languages tend to be detected incorrectly by cld2. The
|
||||
# following list of languages are languages that appear in our data with any
|
||||
# reasonable frequency, and seem to usually be detected *correctly*. These are
|
||||
# the languages we'll keep in the Reddit and Twitter results.
|
||||
#
|
||||
# This list is larger than the list that wordfreq ultimately generates, so we
|
||||
# can look here as a source of future data.
|
||||
|
||||
lang = cld2_detect_language(text)
|
||||
|
||||
# Don't allow tokenization in Chinese when language-detecting, because
|
||||
# the Chinese tokenizer may not be built yet
|
||||
if lang == 'zh':
|
||||
lang = 'en'
|
||||
|
||||
tokens = tokenize(text, lang)
|
||||
return lang, tokens
|
||||
|
||||
|
||||
# Low-frequency languages tend to be detected incorrectly. Keep a limited
|
||||
# list of languages we're allowed to use here.
|
||||
KEEP_THESE_LANGUAGES = {
|
||||
'ar', 'de', 'el', 'en', 'es', 'fr', 'hr', 'id', 'it', 'ja', 'ko', 'ms',
|
||||
'nl', 'pl', 'pt', 'ro', 'ru', 'sv'
|
||||
'af', 'ar', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', 'fi',
|
||||
'fr', 'gl', 'he', 'hi', 'hr', 'hu', 'id', 'is', 'it', 'ja', 'ko', 'lv',
|
||||
'ms', 'nl', 'nn', 'no', 'pl', 'pt', 'ro', 'ru', 'sr', 'sv', 'sw', 'tl',
|
||||
'tr', 'uk', 'vi'
|
||||
}
|
||||
|
||||
# Semi-frequent languages that are excluded by the above:
|
||||
#
|
||||
# - Chinese, not because it's detected incorrectly, but because we can't
|
||||
# handle it until we already have word frequencies
|
||||
# - Thai (seems to be detected whenever someone uses Thai characters in
|
||||
# an emoticon)
|
||||
# - Welsh (which is detected for "ohmygodohmygodohmygod")
|
||||
# - Turkmen (detected for ASCII art)
|
||||
# - Irish Gaelic (detected for Cthulhu-related text)
|
||||
# - Kannada (looks of disapproval)
|
||||
# - Lao, Tamil, Xhosa, Slovak (various emoticons and Internet memes)
|
||||
# - Breton (the word "memes" itself)
|
||||
|
||||
def cld2_reddit_tokenizer(text):
|
||||
|
||||
def cld2_surface_tokenizer(text, mode='twitter'):
|
||||
"""
|
||||
A language-detecting tokenizer with special cases for handling text from
|
||||
Reddit.
|
||||
Uses CLD2 to detect the language and wordfreq tokenizer to create tokens.
|
||||
|
||||
The `mode` can be 'twitter' or 'reddit', which slightly changes the
|
||||
pre-processing of the text.
|
||||
"""
|
||||
text = unescape_html(text)
|
||||
if mode == 'twitter':
|
||||
text = TWITTER_HANDLE_RE.sub('', text)
|
||||
text = TCO_RE.sub('', text)
|
||||
elif mode == 'reddit':
|
||||
text = URL_RE.sub('', text)
|
||||
text = MARKDOWN_URL_RESIDUE_RE.sub(']', text)
|
||||
|
||||
lang = cld2_detect_language(text)
|
||||
if lang not in KEEP_THESE_LANGUAGES:
|
||||
# Reddit is 99.9% English, so if we detected a rare language, it's
|
||||
# much more likely that it's actually English.
|
||||
lang = 'en'
|
||||
|
||||
tokens = tokenize(text, lang, include_punctuation=True)
|
||||
# If the detected language isn't in our pretty generous list of languages,
|
||||
# return no tokens.
|
||||
if lang not in KEEP_THESE_LANGUAGES:
|
||||
return 'xx', []
|
||||
|
||||
# cld2's accuracy seems to improve dramatically with at least 50
|
||||
# bytes of input, so throw away non-English below this length.
|
||||
if len(text.encode('utf-8')) < 50 and lang != 'en':
|
||||
return 'xx', []
|
||||
|
||||
tokens = tokenize(text, lang)
|
||||
return lang, tokens
|
||||
|
||||
|
||||
@ -85,7 +101,12 @@ def cld2_detect_language(text):
|
||||
# Confidence score: float))
|
||||
|
||||
text = CLD2_BAD_CHARS_RE.sub('', text)
|
||||
return pycld2.detect(text)[2][0][1]
|
||||
lang = pycld2.detect(text)[2][0][1]
|
||||
|
||||
# Normalize the language code: 'iw' becomes 'he', and 'zh-Hant'
|
||||
# becomes 'zh'
|
||||
code = langcodes.get(lang).language
|
||||
return code
|
||||
|
||||
|
||||
def tokenize_by_language(in_filename, out_prefix, tokenizer):
|
||||
@ -95,19 +116,17 @@ def tokenize_by_language(in_filename, out_prefix, tokenizer):
|
||||
Produces output files that are separated by language, with spaces
|
||||
between the tokens.
|
||||
"""
|
||||
out_files = {}
|
||||
out_files = {
|
||||
language: open('%s.%s.txt' % (out_prefix, language), 'w', encoding='utf-8')
|
||||
for language in KEEP_THESE_LANGUAGES
|
||||
}
|
||||
with open(in_filename, encoding='utf-8') as in_file:
|
||||
for line in in_file:
|
||||
text = line.split('\t')[-1].strip()
|
||||
language, tokens = tokenizer(text)
|
||||
if language != 'un':
|
||||
if language in KEEP_THESE_LANGUAGES:
|
||||
out_file = out_files[language]
|
||||
tokenized = ' '.join(tokens)
|
||||
out_filename = '%s.%s.txt' % (out_prefix, language)
|
||||
if out_filename in out_files:
|
||||
out_file = out_files[out_filename]
|
||||
else:
|
||||
out_file = open(out_filename, 'w', encoding='utf-8')
|
||||
out_files[out_filename] = out_file
|
||||
print(tokenized, file=out_file)
|
||||
for out_file in out_files.values():
|
||||
out_file.close()
|
||||
|
@ -36,15 +36,17 @@ def count_tokens(filename):
|
||||
return counts
|
||||
|
||||
|
||||
def read_values(filename, cutoff=0, max_size=1e8, lang=None):
|
||||
def read_values(filename, cutoff=0, max_words=1e8, lang=None):
|
||||
"""
|
||||
Read words and their frequency or count values from a CSV file. Returns
|
||||
a dictionary of values and the total of all values.
|
||||
|
||||
Only words with a value greater than or equal to `cutoff` are returned.
|
||||
In addition, only up to `max_words` words are read.
|
||||
|
||||
If `cutoff` is greater than 0, the csv file must be sorted by value
|
||||
in descending order.
|
||||
If `cutoff` is greater than 0 or `max_words` is smaller than the list,
|
||||
the csv file must be sorted by value in descending order, so that the
|
||||
most frequent words are kept.
|
||||
|
||||
If `lang` is given, it will apply language-specific tokenization to the
|
||||
words that it reads.
|
||||
@ -55,7 +57,7 @@ def read_values(filename, cutoff=0, max_size=1e8, lang=None):
|
||||
for key, strval in csv.reader(infile):
|
||||
val = float(strval)
|
||||
key = fix_text(key)
|
||||
if val < cutoff or len(values) >= max_size:
|
||||
if val < cutoff or len(values) >= max_words:
|
||||
break
|
||||
tokens = tokenize(key, lang) if lang is not None else simple_tokenize(key)
|
||||
for token in tokens:
|
||||
|
Loading…
Reference in New Issue
Block a user