expand list of sources and supported languages

Former-commit-id: d9a1c34d00
This commit is contained in:
Robyn Speer 2015-09-04 01:03:36 -04:00
parent 574c383202
commit 3cb4dd777e

View File

@ -118,6 +118,7 @@ of word usage on different topics at different levels of formality. The sources
- **GBooks**: Google Books Ngrams 2013 - **GBooks**: Google Books Ngrams 2013
- **LeedsIC**: The Leeds Internet Corpus - **LeedsIC**: The Leeds Internet Corpus
- **OpenSub**: OpenSubtitles - **OpenSub**: OpenSubtitles
- **SUBTLEX**: The SUBTLEX word frequency lists
- **Twitter**: Messages sampled from Twitter's public stream - **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015 - **Wikipedia**: The full text of Wikipedia in 2015
@ -128,6 +129,7 @@ at least 3 different sources of word frequencies:
──────────────────┼────────────────────────────────────────────────── ──────────────────┼──────────────────────────────────────────────────
Arabic ar │ - - Yes Yes Yes Yes Arabic ar │ - - Yes Yes Yes Yes
German de │ - - Yes Yes Yes[1] Yes German de │ - - Yes Yes Yes[1] Yes
Greek el │ - - Yes Yes Yes Yes
English en │ Yes Yes Yes Yes Yes Yes English en │ Yes Yes Yes Yes Yes Yes
Spanish es │ - - Yes Yes Yes Yes Spanish es │ - - Yes Yes Yes Yes
French fr │ - - Yes Yes Yes Yes French fr │ - - Yes Yes Yes Yes
@ -138,14 +140,14 @@ at least 3 different sources of word frequencies:
Dutch nl │ - - - Yes Yes Yes Dutch nl │ - - - Yes Yes Yes
Portuguese pt │ - - Yes Yes Yes Yes Portuguese pt │ - - Yes Yes Yes Yes
Russian ru │ - - Yes Yes Yes Yes Russian ru │ - - Yes Yes Yes Yes
Turkish tr │ - - - Yes Yes Yes
These 3 languages are only marginally supported so far, either because These languages are only marginally supported so far. We have too few data
they have too few data sources, or in the case of Chinese because we are sources so far in Korean (feel free to suggest some), and we are lacking
lacking tokenization support for it: tokenization support for Chinese.
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
──────────────────┼────────────────────────────────────────────────── ──────────────────┼──────────────────────────────────────────────────
Greek el │ - - Yes Yes - -
Korean ko │ - - - - Yes Yes Korean ko │ - - - - Yes Yes
Chinese zh │ - Yes Yes Yes - - Chinese zh │ - Yes Yes Yes - -