update the README, citing OpenSubtitles 2016

This commit is contained in:
Robyn Speer 2017-01-06 19:04:40 -05:00
parent 86f22e8523
commit 3cb3c38f47

View File

@ -264,21 +264,17 @@ least 3 different sources of word frequencies:
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
they are mutually intelligible and have large amounts of vocabulary in common.
This word list can also be accessed with the language code `sh`.
We list them separately to emphasize that the word list is appropriate for
looking up frequencies in any of those languages, even though the idea of a
unified Serbo-Croatian language is losing popularity. Lookups in `sr` or `sh`
will also automatically unify Cyrillic and Latin spellings.
they share most of their vocabulary and grammar, they were once considered the
same language, and language detection cannot distinguish them. This word list
can also be accessed with the language code `sh`.
[2] The Norwegian text we have is specifically written in Norwegian Bokmål, so
we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
isn't enough data to include it in wordfreq.
we give it the language code 'nb' instead of the vaguer code 'no'. We would use
'nn' for Nynorsk, but there isn't enough data to include it in wordfreq.
[3] This data represents text written in both Simplified and Traditional
Chinese. (SUBTLEX is mostly Simplified, for example, while Wikipedia is mostly
Traditional.) The characters are mapped to one another so they can use the same
underlying word frequency list.
Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
languages" below.
Some languages provide 'large' wordlists, including words with a Zipf frequency
between 1.0 and 3.0. These are available in 12 languages that are covered by
@ -336,6 +332,55 @@ their frequency:
3.26
## Multi-script languages
Two of the languages we support, Serbian and Chinese, are written in multiple
scripts. To avoid spurious differences in word frequencies, we automatically
transliterate the characters in these languages when looking up their words.
Serbian text written in Cyrillic letters is automatically converted to Latin
letters, using standard Serbian transliteration, when the requested language is
`sr` or `sh`. If you request the word list as `hr` (Croatian) or `bs`
(Bosnian), no transliteration will occur.
Chinese text is converted internally to a representation we call
"Oversimplified Chinese", where all Traditional Chinese characters are replaced
with their Simplified Chinese equivalent, *even if* they would not be written
that way in context. This representation lets us use a straightforward mapping
that matches both Traditional and Simplified words, unifying their frequencies
when appropriate, and does not appear to create clashes between unrelated words.
Enumerating the Chinese wordlist will produce some unfamiliar words, because
people don't actually write in Oversimplified Chinese, and because in
practice Traditional and Simplified Chinese also have different word usage.
## Similar, overlapping, and varying languages
As much as we would like to give each language its own distinct code and its
own distinct word list with distinct source data, there aren't actually sharp
boundaries between languages.
Sometimes, it's convenient to pretend that the boundaries between
languages coincide with national borders, following the maxim that "a language
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
when the linguistic situation and the political situation diverge.
Moreover, some of our data sources rely on language detection, which of course
has no idea which country the writer of the text belongs to.
So we've had to make some arbitrary decisions about how to represent the
fuzzier language boundaries, such as those within Chinese, Malay, and
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
the mutual intelligibility or unintelligibility of languages.
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
module to find the best match for a language code. If you ask for word
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
Simplified Chinese), you will get the `zh` wordlist, for example.
## License
`wordfreq` is freely redistributable under the MIT license (see
@ -363,6 +408,10 @@ sources:
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
It contains data from OPUS OpenSubtitles 2016
(http://opus.lingfil.uu.se/OpenSubtitles2016.php), whose data originates from
the OpenSubtitles project (http://www.opensubtitles.org/).
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
(see citations below) and available at
@ -457,6 +506,11 @@ The same citation in BibTex format:
analyzer.
http://mecab.sourceforge.net/
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
International Conference on Language Resources and Evaluation (LREC 2016).
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
SUBTLEX-UK: A new and improved word frequency database for British English.
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.