From e6a2886a66714d818c43cac51fffd9c4195c21b6 Mon Sep 17 00:00:00 2001 From: Rob Speer Date: Thu, 3 Sep 2015 18:56:56 -0400 Subject: [PATCH] add SUBTLEX to the readme --- README.md | 58 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index c0eb421..d95bae3 100644 --- a/README.md +++ b/README.md @@ -121,31 +121,33 @@ of word usage on different topics at different levels of formality. The sources - **Twitter**: Messages sampled from Twitter's public stream - **Wikipedia**: The full text of Wikipedia in 2015 -The following 12 languages are well-supported, using at least 3 different sources -of word frequencies: +The following 12 languages are well-supported, with reasonable tokenization and +at least 3 different sources of word frequencies: - Language Code GBooks LeedsIC OpenSub Twitter Wikipedia - ──────────────────┼────────────────────────────────────────── - Arabic ar │ - Yes Yes Yes Yes - German de │ - Yes Yes Yes[1] Yes - English en │ Yes Yes Yes Yes Yes - Spanish es │ - Yes Yes Yes Yes - French fr │ - Yes Yes Yes Yes - Indonesian id │ - - Yes Yes Yes - Italian it │ - Yes Yes Yes Yes - Japanese ja │ - Yes - Yes Yes - Malay ms │ - - Yes Yes Yes - Dutch nl │ - - Yes Yes Yes - Portuguese pt │ - Yes Yes Yes Yes - Russian ru │ - Yes Yes Yes Yes + Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia + ──────────────────┼────────────────────────────────────────────────── + Arabic ar │ - - Yes Yes Yes Yes + German de │ - - Yes Yes Yes[1] Yes + English en │ Yes Yes Yes Yes Yes Yes + Spanish es │ - - Yes Yes Yes Yes + French fr │ - - Yes Yes Yes Yes + Indonesian id │ - - - Yes Yes Yes + Italian it │ - - Yes Yes Yes Yes + Japanese ja │ - - Yes - Yes Yes + Malay ms │ - - - Yes Yes Yes + Dutch nl │ - - - Yes Yes Yes + Portuguese pt │ - - Yes Yes Yes Yes + Russian ru │ - - Yes Yes Yes Yes -These 3 languages are only marginally supported so far: +These 3 languages are only marginally supported so far, either because +they have too few data sources, or in the case of Chinese because we are +lacking tokenization support for it: - Language Code GBooks LeedsIC OpenSub Twitter Wikipedia - ──────────────────┼────────────────────────────────────────── - Greek el │ - Yes Yes - - - Korean ko │ - - - Yes Yes - Chinese zh │ - Yes Yes - - + Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia + ──────────────────┼────────────────────────────────────────────────── + Greek el │ - - Yes Yes - - + Korean ko │ - - - - Yes Yes + Chinese zh │ - Yes Yes Yes - - [1] We've counted the frequencies from tweets in German, such as they are, but you should be aware that German is not a frequently-used language on Twitter. @@ -219,6 +221,18 @@ sources: - Wikipedia, the free encyclopedia (http://www.wikipedia.org) +It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and +SUBTLEX-CH, created by Marc Brysbaert et al. and available at +http://crr.ugent.be/programs-data/subtitle-frequencies. I (Rob Speer) have +obtained permission by e-mail from Marc Brysbaert to distribute these wordlists +in wordfreq, to be used for any purpose, not just for academic use, under these +conditions: + +- Wordfreq and code derived from it must credit the SUBTLEX authors. +- It must remain clear that SUBTLEX is freely available data. + +These terms are similar to the Creative Commons Attribution-ShareAlike license. + Some additional data was collected by a custom application that watches the streaming Twitter API, in accordance with Twitter's Developer Agreement & Policy. This software gives statistics about words that are commonly used on