add SUBTLEX to the readme

Former-commit-id: e6a2886a66
This commit is contained in:
Robyn Speer 2015-09-03 18:56:56 -04:00
parent f66d03b1b9
commit d267e0967c

View File

@ -121,31 +121,33 @@ of word usage on different topics at different levels of formality. The sources
- **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015
The following 12 languages are well-supported, using at least 3 different sources
of word frequencies:
The following 12 languages are well-supported, with reasonable tokenization and
at least 3 different sources of word frequencies:
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes
German de │ - Yes Yes Yes[1] Yes
English en │ Yes Yes Yes Yes Yes
Spanish es │ - Yes Yes Yes Yes
French fr │ - Yes Yes Yes Yes
Indonesian id │ - - Yes Yes Yes
Italian it │ - Yes Yes Yes Yes
Japanese ja │ - Yes - Yes Yes
Malay ms │ - - Yes Yes Yes
Dutch nl │ - - Yes Yes Yes
Portuguese pt │ - Yes Yes Yes Yes
Russian ru │ - Yes Yes Yes Yes
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Arabic ar │ - - Yes Yes Yes Yes
German de │ - - Yes Yes Yes[1] Yes
English en │ Yes Yes Yes Yes Yes Yes
Spanish es │ - - Yes Yes Yes Yes
French fr │ - - Yes Yes Yes Yes
Indonesian id │ - - - Yes Yes Yes
Italian it │ - - Yes Yes Yes Yes
Japanese ja │ - - Yes - Yes Yes
Malay ms │ - - - Yes Yes Yes
Dutch nl │ - - - Yes Yes Yes
Portuguese pt │ - - Yes Yes Yes Yes
Russian ru │ - - Yes Yes Yes Yes
These 3 languages are only marginally supported so far:
These 3 languages are only marginally supported so far, either because
they have too few data sources, or in the case of Chinese because we are
lacking tokenization support for it:
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────
Greek el │ - Yes Yes - -
Korean ko │ - - - Yes Yes
Chinese zh │ - Yes Yes - -
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────────────
Greek el │ - - Yes Yes - -
Korean ko │ - - - - Yes Yes
Chinese zh │ - Yes Yes Yes - -
[1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter.
@ -219,6 +221,18 @@ sources:
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
http://crr.ugent.be/programs-data/subtitle-frequencies. I (Robyn Speer) have
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
in wordfreq, to be used for any purpose, not just for academic use, under these
conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
- It must remain clear that SUBTLEX is freely available data.
These terms are similar to the Creative Commons Attribution-ShareAlike license.
Some additional data was collected by a custom application that watches the
streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that are commonly used on