mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
update the README, citing OpenSubtitles 2016
This commit is contained in:
parent
86f22e8523
commit
3cb3c38f47
86
README.md
86
README.md
@ -230,7 +230,7 @@ least 3 different sources of word frequencies:
|
||||
Language Code # Large? WP Subs News Books Web Twit. Redd. Misc.
|
||||
──────────────────────────────┼────────────────────────────────────────────────
|
||||
Arabic ar 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Bosnian bs[1] 3 │ Yes Yes - - - Yes - -
|
||||
Bosnian bs [1] 3 │ Yes Yes - - - Yes - -
|
||||
Bulgarian bg 3 - │ Yes Yes - - - Yes - -
|
||||
Catalan ca 4 - │ Yes Yes Yes - - Yes - -
|
||||
Czech cs 3 - │ Yes Yes - - - Yes - -
|
||||
@ -244,41 +244,37 @@ least 3 different sources of word frequencies:
|
||||
French fr 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Hebrew he 4 - │ Yes Yes - Yes - Yes - -
|
||||
Hindi hi 3 - │ Yes - - - - Yes Yes -
|
||||
Croatian hr[1] 3 │ Yes Yes - - - Yes - -
|
||||
Croatian hr [1] 3 │ Yes Yes - - - Yes - -
|
||||
Hungarian hu 3 - │ Yes Yes - - Yes - - -
|
||||
Indonesian id 3 - │ Yes Yes - - - Yes - -
|
||||
Italian it 7 Yes │ Yes Yes Yes Yes Yes Yes Yes -
|
||||
Japanese ja 5 Yes │ Yes Yes - - Yes Yes Yes -
|
||||
Korean ko 4 - │ Yes Yes - - - Yes Yes -
|
||||
Malay ms 3 - │ Yes Yes - - - Yes - -
|
||||
Norwegian nb[2] 4 - │ Yes Yes - - - Yes Yes -
|
||||
Norwegian nb [2] 4 - │ Yes Yes - - - Yes Yes -
|
||||
Dutch nl 4 Yes │ Yes Yes Yes - - Yes - -
|
||||
Polish pl 5 Yes │ Yes Yes Yes - - Yes Yes -
|
||||
Portuguese pt 5 Yes │ Yes Yes Yes - Yes Yes - -
|
||||
Romanian ro 3 - │ Yes Yes - - - Yes - -
|
||||
Russian ru 6 Yes │ Yes Yes Yes Yes Yes Yes - -
|
||||
Serbian sr[1] 3 - │ Yes Yes - - - Yes - -
|
||||
Serbian sr [1] 3 - │ Yes Yes - - - Yes - -
|
||||
Swedish sv 4 - │ Yes Yes - - - Yes Yes -
|
||||
Turkish tr 3 - │ Yes Yes - - - Yes - -
|
||||
Ukrainian uk 4 - │ Yes Yes - - - Yes Yes -
|
||||
Chinese zh[3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
Chinese zh [3] 6 Yes │ Yes - Yes Yes Yes Yes - Jieba
|
||||
|
||||
[1] Bosnian, Croatian, and Serbian use the same underlying word list, because
|
||||
they are mutually intelligible and have large amounts of vocabulary in common.
|
||||
This word list can also be accessed with the language code `sh`.
|
||||
We list them separately to emphasize that the word list is appropriate for
|
||||
looking up frequencies in any of those languages, even though the idea of a
|
||||
unified Serbo-Croatian language is losing popularity. Lookups in `sr` or `sh`
|
||||
will also automatically unify Cyrillic and Latin spellings.
|
||||
they share most of their vocabulary and grammar, they were once considered the
|
||||
same language, and language detection cannot distinguish them. This word list
|
||||
can also be accessed with the language code `sh`.
|
||||
|
||||
[2] The Norwegian text we have is specifically written in Norwegian Bokmål, so
|
||||
we give it the language code 'nb'. We would use 'nn' for Nynorsk, but there
|
||||
isn't enough data to include it in wordfreq.
|
||||
we give it the language code 'nb' instead of the vaguer code 'no'. We would use
|
||||
'nn' for Nynorsk, but there isn't enough data to include it in wordfreq.
|
||||
|
||||
[3] This data represents text written in both Simplified and Traditional
|
||||
Chinese. (SUBTLEX is mostly Simplified, for example, while Wikipedia is mostly
|
||||
Traditional.) The characters are mapped to one another so they can use the same
|
||||
underlying word frequency list.
|
||||
Chinese, with primarily Mandarin Chinese vocabulary. See "Multi-script
|
||||
languages" below.
|
||||
|
||||
Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||
between 1.0 and 3.0. These are available in 12 languages that are covered by
|
||||
@ -336,6 +332,55 @@ their frequency:
|
||||
3.26
|
||||
|
||||
|
||||
## Multi-script languages
|
||||
|
||||
Two of the languages we support, Serbian and Chinese, are written in multiple
|
||||
scripts. To avoid spurious differences in word frequencies, we automatically
|
||||
transliterate the characters in these languages when looking up their words.
|
||||
|
||||
Serbian text written in Cyrillic letters is automatically converted to Latin
|
||||
letters, using standard Serbian transliteration, when the requested language is
|
||||
`sr` or `sh`. If you request the word list as `hr` (Croatian) or `bs`
|
||||
(Bosnian), no transliteration will occur.
|
||||
|
||||
Chinese text is converted internally to a representation we call
|
||||
"Oversimplified Chinese", where all Traditional Chinese characters are replaced
|
||||
with their Simplified Chinese equivalent, *even if* they would not be written
|
||||
that way in context. This representation lets us use a straightforward mapping
|
||||
that matches both Traditional and Simplified words, unifying their frequencies
|
||||
when appropriate, and does not appear to create clashes between unrelated words.
|
||||
|
||||
Enumerating the Chinese wordlist will produce some unfamiliar words, because
|
||||
people don't actually write in Oversimplified Chinese, and because in
|
||||
practice Traditional and Simplified Chinese also have different word usage.
|
||||
|
||||
|
||||
## Similar, overlapping, and varying languages
|
||||
|
||||
As much as we would like to give each language its own distinct code and its
|
||||
own distinct word list with distinct source data, there aren't actually sharp
|
||||
boundaries between languages.
|
||||
|
||||
Sometimes, it's convenient to pretend that the boundaries between
|
||||
languages coincide with national borders, following the maxim that "a language
|
||||
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
|
||||
when the linguistic situation and the political situation diverge.
|
||||
Moreover, some of our data sources rely on language detection, which of course
|
||||
has no idea which country the writer of the text belongs to.
|
||||
|
||||
So we've had to make some arbitrary decisions about how to represent the
|
||||
fuzzier language boundaries, such as those within Chinese, Malay, and
|
||||
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
|
||||
the mutual intelligibility or unintelligibility of languages.
|
||||
|
||||
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
|
||||
|
||||
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
|
||||
module to find the best match for a language code. If you ask for word
|
||||
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||
|
||||
|
||||
## License
|
||||
|
||||
`wordfreq` is freely redistributable under the MIT license (see
|
||||
@ -363,6 +408,10 @@ sources:
|
||||
|
||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||
|
||||
It contains data from OPUS OpenSubtitles 2016
|
||||
(http://opus.lingfil.uu.se/OpenSubtitles2016.php), whose data originates from
|
||||
the OpenSubtitles project (http://www.opensubtitles.org/).
|
||||
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||
(see citations below) and available at
|
||||
@ -457,6 +506,11 @@ The same citation in BibTex format:
|
||||
analyzer.
|
||||
http://mecab.sourceforge.net/
|
||||
|
||||
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
||||
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
||||
International Conference on Language Resources and Evaluation (LREC 2016).
|
||||
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
||||
|
||||
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||
|
Loading…
Reference in New Issue
Block a user