Use SUBTLEX for German, but OpenSubtitles for Greek

In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
This commit is contained in:
Rob Speer 2015-09-04 15:52:21 -04:00
parent a47497c908
commit 77c60c29b0
2 changed files with 10 additions and 5 deletions

View File

@ -129,7 +129,7 @@ at least 3 different sources of word frequencies:
──────────────────┼────────────────────────────────────────────────── ──────────────────┼──────────────────────────────────────────────────
Arabic ar │ - - Yes Yes Yes Yes Arabic ar │ - - Yes Yes Yes Yes
German de │ - Yes Yes Yes Yes[1] Yes German de │ - Yes Yes Yes Yes[1] Yes
Greek el │ - Yes Yes Yes Yes Yes Greek el │ - - Yes Yes Yes Yes
English en │ Yes Yes Yes Yes Yes Yes English en │ Yes Yes Yes Yes Yes Yes
Spanish es │ - - Yes Yes Yes Yes Spanish es │ - - Yes Yes Yes Yes
French fr │ - - Yes Yes Yes Yes French fr │ - - Yes Yes Yes Yes
@ -252,6 +252,10 @@ Twitter; it does not display or republish any Twitter content.
Methods, 41 (4), 977-990. Methods, 41 (4), 977-990.
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
(2015). The word frequency effect. Experimental Psychology.
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character - Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
frequencies based on film subtitles. PLoS One, 5(6), e10729. frequencies based on film subtitles. PLoS One, 5(6), e10729.
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729

View File

@ -23,9 +23,10 @@ CONFIG = {
'pt', 'ru', 'tr' 'pt', 'ru', 'tr'
], ],
'opensubtitles': [ 'opensubtitles': [
# All languages where the most common word in OpenSubtitles # This list includes languages where the most common word in
# appears at least 5000 times # OpenSubtitles appears at least 5000 times. However, we exclude
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'de', 'el', 'en', 'es', 'et', # German, where SUBTLEX has done better processing of the same data.
'ar', 'bg', 'bs', 'ca', 'cs', 'da', 'el', 'en', 'es', 'et',
'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv', 'fa', 'fi', 'fr', 'he', 'hr', 'hu', 'id', 'is', 'it', 'lt', 'lv',
'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq', 'mk', 'ms', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sq',
'sr', 'sv', 'tr', 'uk', 'zh' 'sr', 'sv', 'tr', 'uk', 'zh'
@ -39,7 +40,7 @@ CONFIG = {
# Russian, Spanish, and (Simplified) Chinese. # Russian, Spanish, and (Simplified) Chinese.
], ],
'subtlex-en': ['en'], 'subtlex-en': ['en'],
'subtlex-other': ['de', 'el', 'nl', 'zh'], 'subtlex-other': ['de', 'nl', 'zh'],
}, },
# Subtlex languages that need to be pre-processed # Subtlex languages that need to be pre-processed
'wordlist_paths': { 'wordlist_paths': {