support Turkish and more Greek; document more

Former-commit-id: d94428d454
This commit is contained in:
Robyn Speer 2015-09-04 00:57:04 -04:00
parent f168c37417
commit 574c383202
5 changed files with 71 additions and 3 deletions

2
.gitignore vendored
View File

@ -7,3 +7,5 @@ pip-log.txt
.coverage .coverage
*~ *~
wordfreq-data.tar.gz wordfreq-data.tar.gz
.idea
build.dot

View File

@ -223,7 +223,11 @@ sources:
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
SUBTLEX-CH, created by Marc Brysbaert et al. and available at SUBTLEX-CH, created by Marc Brysbaert et al. and available at
http://crr.ugent.be/programs-data/subtitle-frequencies. I (Robyn Speer) have http://crr.ugent.be/programs-data/subtitle-frequencies. SUBTLEX was first
published in this paper:
I (Robyn Speer) have
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
in wordfreq, to be used for any purpose, not just for academic use, under these in wordfreq, to be used for any purpose, not just for academic use, under these
conditions: conditions:
@ -237,3 +241,28 @@ Some additional data was collected by a custom application that watches the
streaming Twitter API, in accordance with Twitter's Developer Agreement & streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on Policy. This software gives statistics about words that are commonly used on
Twitter; it does not display or republish any Twitter content. Twitter; it does not display or republish any Twitter content.
## Citations to work that wordfreq is built on
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
Evaluation of Current Word Frequency Norms and the Introduction of a New and
Improved Word Frequency Measure for American English. Behavior Research
Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
analyzer.
http://mecab.sourceforge.net/
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

View File

@ -65,6 +65,15 @@ def simple_tokenize(text):
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)] return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def turkish_tokenize(text):
"""
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
in Turkish.
"""
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def remove_arabic_marks(text): def remove_arabic_marks(text):
""" """
Remove decorations from Arabic words: Remove decorations from Arabic words:
@ -90,6 +99,8 @@ def tokenize(text, lang):
- Chinese or Japanese texts that aren't identified as the appropriate - Chinese or Japanese texts that aren't identified as the appropriate
language will only split on punctuation and script boundaries, giving language will only split on punctuation and script boundaries, giving
you untokenized globs of characters that probably represent many words. you untokenized globs of characters that probably represent many words.
- Turkish will use a different case-folding procedure, so that capital
I and İ map to ı and i respectively.
- All other languages will be tokenized using a regex that mostly - All other languages will be tokenized using a regex that mostly
implements the Word Segmentation section of Unicode Annex #29. implements the Word Segmentation section of Unicode Annex #29.
See `simple_tokenize` for details. See `simple_tokenize` for details.
@ -107,6 +118,9 @@ def tokenize(text, lang):
from wordfreq.mecab import mecab_tokenize from wordfreq.mecab import mecab_tokenize
return mecab_tokenize(text) return mecab_tokenize(text)
if lang == 'tr':
return turkish_tokenize(text)
if lang == 'ar': if lang == 'ar':
text = remove_arabic_marks(unicodedata.normalize('NFKC', text)) text = remove_arabic_marks(unicodedata.normalize('NFKC', text))

View File

@ -161,3 +161,27 @@ longer represents the words 'don' and 'won', as we assume most of their
frequency comes from "don't" and "won't". Words that turned into similarly frequency comes from "don't" and "won't". Words that turned into similarly
common words, however, were left alone: this list doesn't represent "can't" common words, however, were left alone: this list doesn't represent "can't"
because the word was left as "can". because the word was left as "can".
### SUBTLEX
Mark Brysbaert gave us permission by e-mail to use the SUBTLEX word lists in
wordfreq and derived works without the "academic use" restriction, under the
following reasonable conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
(See the citations in the top-level `README.md` file.)
- It must remain clear that SUBTLEX is freely available data.
`data/source-lists/subtlex` contains the following files:
- `subtlex.en-US.txt`, which was downloaded from [here][subtlex-us],
extracted, and converted from ISO-8859-1 to UTF-8
- `subtlex.en-GB.txt`, which was exported as tab-separated UTF-8
from [this Excel file][subtlex-uk]
- `subtlex.zh.txt`, which was downloaded and extracted from
[here][subtlex-ch]
[subtlex-us]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus5.zip
[subtlex-uk]: http://crr.ugent.be/papers/SUBTLEX-UK_all.xlsx
[subtlex-ch]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/subtlexch131210.zip

View File

@ -14,8 +14,7 @@ CONFIG = {
], ],
'wikipedia': [ 'wikipedia': [
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl', 'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru' 'pt', 'ru', 'tr'
# consider adding 'tr'
], ],
'opensubtitles': [ 'opensubtitles': [
# All languages where the most common word in OpenSubtitles # All languages where the most common word in OpenSubtitles