From d9a1c34d009d7ecefe9d0761f590761bbf08a9df Mon Sep 17 00:00:00 2001 From: Rob Speer Date: Fri, 4 Sep 2015 01:03:36 -0400 Subject: [PATCH] expand list of sources and supported languages --- README.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 9a584f9..b509301 100644 --- a/README.md +++ b/README.md @@ -118,6 +118,7 @@ of word usage on different topics at different levels of formality. The sources - **GBooks**: Google Books Ngrams 2013 - **LeedsIC**: The Leeds Internet Corpus - **OpenSub**: OpenSubtitles +- **SUBTLEX**: The SUBTLEX word frequency lists - **Twitter**: Messages sampled from Twitter's public stream - **Wikipedia**: The full text of Wikipedia in 2015 @@ -128,6 +129,7 @@ at least 3 different sources of word frequencies: ──────────────────┼────────────────────────────────────────────────── Arabic ar │ - - Yes Yes Yes Yes German de │ - - Yes Yes Yes[1] Yes + Greek el │ - - Yes Yes Yes Yes English en │ Yes Yes Yes Yes Yes Yes Spanish es │ - - Yes Yes Yes Yes French fr │ - - Yes Yes Yes Yes @@ -138,14 +140,14 @@ at least 3 different sources of word frequencies: Dutch nl │ - - - Yes Yes Yes Portuguese pt │ - - Yes Yes Yes Yes Russian ru │ - - Yes Yes Yes Yes + Turkish tr │ - - - Yes Yes Yes -These 3 languages are only marginally supported so far, either because -they have too few data sources, or in the case of Chinese because we are -lacking tokenization support for it: +These languages are only marginally supported so far. We have too few data +sources so far in Korean (feel free to suggest some), and we are lacking +tokenization support for Chinese. Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia ──────────────────┼────────────────────────────────────────────────── - Greek el │ - - Yes Yes - - Korean ko │ - - - - Yes Yes Chinese zh │ - Yes Yes Yes - -