mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
parent
f66d03b1b9
commit
d267e0967c
58
README.md
58
README.md
@ -121,31 +121,33 @@ of word usage on different topics at different levels of formality. The sources
|
|||||||
- **Twitter**: Messages sampled from Twitter's public stream
|
- **Twitter**: Messages sampled from Twitter's public stream
|
||||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
- **Wikipedia**: The full text of Wikipedia in 2015
|
||||||
|
|
||||||
The following 12 languages are well-supported, using at least 3 different sources
|
The following 12 languages are well-supported, with reasonable tokenization and
|
||||||
of word frequencies:
|
at least 3 different sources of word frequencies:
|
||||||
|
|
||||||
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
|
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||||
──────────────────┼──────────────────────────────────────────
|
──────────────────┼──────────────────────────────────────────────────
|
||||||
Arabic ar │ - Yes Yes Yes Yes
|
Arabic ar │ - - Yes Yes Yes Yes
|
||||||
German de │ - Yes Yes Yes[1] Yes
|
German de │ - - Yes Yes Yes[1] Yes
|
||||||
English en │ Yes Yes Yes Yes Yes
|
English en │ Yes Yes Yes Yes Yes Yes
|
||||||
Spanish es │ - Yes Yes Yes Yes
|
Spanish es │ - - Yes Yes Yes Yes
|
||||||
French fr │ - Yes Yes Yes Yes
|
French fr │ - - Yes Yes Yes Yes
|
||||||
Indonesian id │ - - Yes Yes Yes
|
Indonesian id │ - - - Yes Yes Yes
|
||||||
Italian it │ - Yes Yes Yes Yes
|
Italian it │ - - Yes Yes Yes Yes
|
||||||
Japanese ja │ - Yes - Yes Yes
|
Japanese ja │ - - Yes - Yes Yes
|
||||||
Malay ms │ - - Yes Yes Yes
|
Malay ms │ - - - Yes Yes Yes
|
||||||
Dutch nl │ - - Yes Yes Yes
|
Dutch nl │ - - - Yes Yes Yes
|
||||||
Portuguese pt │ - Yes Yes Yes Yes
|
Portuguese pt │ - - Yes Yes Yes Yes
|
||||||
Russian ru │ - Yes Yes Yes Yes
|
Russian ru │ - - Yes Yes Yes Yes
|
||||||
|
|
||||||
These 3 languages are only marginally supported so far:
|
These 3 languages are only marginally supported so far, either because
|
||||||
|
they have too few data sources, or in the case of Chinese because we are
|
||||||
|
lacking tokenization support for it:
|
||||||
|
|
||||||
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
|
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||||
──────────────────┼──────────────────────────────────────────
|
──────────────────┼──────────────────────────────────────────────────
|
||||||
Greek el │ - Yes Yes - -
|
Greek el │ - - Yes Yes - -
|
||||||
Korean ko │ - - - Yes Yes
|
Korean ko │ - - - - Yes Yes
|
||||||
Chinese zh │ - Yes Yes - -
|
Chinese zh │ - Yes Yes Yes - -
|
||||||
|
|
||||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||||
you should be aware that German is not a frequently-used language on Twitter.
|
you should be aware that German is not a frequently-used language on Twitter.
|
||||||
@ -219,6 +221,18 @@ sources:
|
|||||||
|
|
||||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||||
|
|
||||||
|
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
|
||||||
|
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
|
||||||
|
http://crr.ugent.be/programs-data/subtitle-frequencies. I (Robyn Speer) have
|
||||||
|
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
|
||||||
|
in wordfreq, to be used for any purpose, not just for academic use, under these
|
||||||
|
conditions:
|
||||||
|
|
||||||
|
- Wordfreq and code derived from it must credit the SUBTLEX authors.
|
||||||
|
- It must remain clear that SUBTLEX is freely available data.
|
||||||
|
|
||||||
|
These terms are similar to the Creative Commons Attribution-ShareAlike license.
|
||||||
|
|
||||||
Some additional data was collected by a custom application that watches the
|
Some additional data was collected by a custom application that watches the
|
||||||
streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
||||||
Policy. This software gives statistics about words that are commonly used on
|
Policy. This software gives statistics about words that are commonly used on
|
||||||
|
Loading…
Reference in New Issue
Block a user