support Turkish and more Greek; document more

This commit is contained in:
Rob Speer 2015-09-04 00:57:04 -04:00
parent 45d871a815
commit d94428d454
5 changed files with 71 additions and 3 deletions

2
.gitignore vendored
View File

@ -7,3 +7,5 @@ pip-log.txt
.coverage
*~
wordfreq-data.tar.gz
.idea
build.dot

View File

@ -223,7 +223,11 @@ sources:
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
http://crr.ugent.be/programs-data/subtitle-frequencies. I (Rob Speer) have
http://crr.ugent.be/programs-data/subtitle-frequencies. SUBTLEX was first
published in this paper:
I (Rob Speer) have
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
in wordfreq, to be used for any purpose, not just for academic use, under these
conditions:
@ -237,3 +241,28 @@ Some additional data was collected by a custom application that watches the
streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on
Twitter; it does not display or republish any Twitter content.
## Citations to work that wordfreq is built on
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
Evaluation of Current Word Frequency Norms and the Introduction of a New and
Improved Word Frequency Measure for American English. Behavior Research
Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
http://unicode.org/reports/tr29/
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
analyzer.
http://mecab.sourceforge.net/
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521

View File

@ -65,6 +65,15 @@ def simple_tokenize(text):
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def turkish_tokenize(text):
"""
Like `simple_tokenize`, but modifies i's so that they case-fold correctly
in Turkish.
"""
text = unicodedata.normalize('NFC', text).replace('İ', 'i').replace('I', 'ı')
return [token.strip("'").casefold() for token in TOKEN_RE.findall(text)]
def remove_arabic_marks(text):
"""
Remove decorations from Arabic words:
@ -90,6 +99,8 @@ def tokenize(text, lang):
- Chinese or Japanese texts that aren't identified as the appropriate
language will only split on punctuation and script boundaries, giving
you untokenized globs of characters that probably represent many words.
- Turkish will use a different case-folding procedure, so that capital
I and İ map to ı and i respectively.
- All other languages will be tokenized using a regex that mostly
implements the Word Segmentation section of Unicode Annex #29.
See `simple_tokenize` for details.
@ -107,6 +118,9 @@ def tokenize(text, lang):
from wordfreq.mecab import mecab_tokenize
return mecab_tokenize(text)
if lang == 'tr':
return turkish_tokenize(text)
if lang == 'ar':
text = remove_arabic_marks(unicodedata.normalize('NFKC', text))

View File

@ -161,3 +161,27 @@ longer represents the words 'don' and 'won', as we assume most of their
frequency comes from "don't" and "won't". Words that turned into similarly
common words, however, were left alone: this list doesn't represent "can't"
because the word was left as "can".
### SUBTLEX
Mark Brysbaert gave us permission by e-mail to use the SUBTLEX word lists in
wordfreq and derived works without the "academic use" restriction, under the
following reasonable conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
(See the citations in the top-level `README.md` file.)
- It must remain clear that SUBTLEX is freely available data.
`data/source-lists/subtlex` contains the following files:
- `subtlex.en-US.txt`, which was downloaded from [here][subtlex-us],
extracted, and converted from ISO-8859-1 to UTF-8
- `subtlex.en-GB.txt`, which was exported as tab-separated UTF-8
from [this Excel file][subtlex-uk]
- `subtlex.zh.txt`, which was downloaded and extracted from
[here][subtlex-ch]
[subtlex-us]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus/subtlexus5.zip
[subtlex-uk]: http://crr.ugent.be/papers/SUBTLEX-UK_all.xlsx
[subtlex-ch]: http://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexch/subtlexch131210.zip

View File

@ -14,8 +14,7 @@ CONFIG = {
],
'wikipedia': [
'ar', 'de', 'en', 'el', 'es', 'fr', 'id', 'it', 'ja', 'ko', 'ms', 'nl',
'pt', 'ru'
# consider adding 'tr'
'pt', 'ru', 'tr'
],
'opensubtitles': [
# All languages where the most common word in OpenSubtitles