mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
parent
0c7527140c
commit
cecf852040
43
README.md
43
README.md
@ -167,6 +167,7 @@ The sources (and the abbreviations we'll use for them) are:
|
|||||||
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
- **OpenSub**: Data derived from OpenSubtitles but not from SUBTLEX
|
||||||
- **Twitter**: Messages sampled from Twitter's public stream
|
- **Twitter**: Messages sampled from Twitter's public stream
|
||||||
- **Wpedia**: The full text of Wikipedia in 2015
|
- **Wpedia**: The full text of Wikipedia in 2015
|
||||||
|
- **Reddit**: The corpus of Reddit comments through May 2015
|
||||||
- **Other**: We get additional English frequencies from Google Books Syntactic
|
- **Other**: We get additional English frequencies from Google Books Syntactic
|
||||||
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
Ngrams 2013, and Chinese frequencies from the frequency dictionary that
|
||||||
comes with the Jieba tokenizer.
|
comes with the Jieba tokenizer.
|
||||||
@ -174,35 +175,37 @@ The sources (and the abbreviations we'll use for them) are:
|
|||||||
The following 17 languages are well-supported, with reasonable tokenization and
|
The following 17 languages are well-supported, with reasonable tokenization and
|
||||||
at least 3 different sources of word frequencies:
|
at least 3 different sources of word frequencies:
|
||||||
|
|
||||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Other
|
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia Reddit Other
|
||||||
──────────────────┼─────────────────────────────────────────────────────
|
──────────────────┼─────────────────────────────────────────────────────
|
||||||
Arabic ar │ - Yes Yes Yes Yes -
|
Arabic ar │ - Yes Yes Yes Yes - -
|
||||||
German de │ Yes - Yes Yes[1] Yes -
|
German de │ Yes - Yes Yes[1] Yes Yes -
|
||||||
Greek el │ - Yes Yes Yes Yes -
|
Greek el │ - Yes Yes Yes Yes - -
|
||||||
English en │ Yes Yes Yes Yes Yes Google Books
|
English en │ Yes Yes Yes Yes Yes Yes Google Books
|
||||||
Spanish es │ - Yes Yes Yes Yes -
|
Spanish es │ - Yes Yes Yes Yes Yes -
|
||||||
French fr │ - Yes Yes Yes Yes -
|
French fr │ - Yes Yes Yes Yes - -
|
||||||
Indonesian id │ - Yes - Yes Yes -
|
Indonesian id │ - Yes - Yes Yes - -
|
||||||
Italian it │ - Yes Yes Yes Yes -
|
Italian it │ - Yes Yes Yes Yes - -
|
||||||
Japanese ja │ - - Yes Yes Yes -
|
Japanese ja │ - - Yes Yes Yes - -
|
||||||
Malay ms │ - Yes - Yes Yes -
|
Malay ms │ - Yes - Yes Yes - -
|
||||||
Dutch nl │ Yes Yes - Yes Yes -
|
Dutch nl │ Yes Yes - Yes Yes - -
|
||||||
Polish pl │ - Yes - Yes Yes -
|
Polish pl │ - Yes - Yes Yes - -
|
||||||
Portuguese pt │ - Yes Yes Yes Yes -
|
Portuguese pt │ - Yes Yes Yes Yes - -
|
||||||
Russian ru │ - Yes Yes Yes Yes -
|
Russian ru │ - Yes Yes Yes Yes - -
|
||||||
Swedish sv │ - Yes - Yes Yes -
|
Swedish sv │ - Yes - Yes Yes Yes -
|
||||||
Turkish tr │ - Yes - Yes Yes -
|
Turkish tr │ - Yes - Yes Yes - -
|
||||||
Chinese zh │ Yes - Yes - - Jieba
|
Chinese zh │ Yes - Yes - - - Jieba
|
||||||
|
|
||||||
|
|
||||||
Additionally, Korean is marginally supported. You can look up frequencies in
|
Additionally, Korean is marginally supported. You can look up frequencies in
|
||||||
it, but we have too few data sources for it so far:
|
it, but it will be insufficiently tokenized into words, and we have too few
|
||||||
|
data sources for it so far:
|
||||||
|
|
||||||
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
|
Language Code SUBTLEX OpenSub LeedsIC Twitter Wpedia
|
||||||
──────────────────┼───────────────────────────────────────
|
──────────────────┼───────────────────────────────────────
|
||||||
Korean ko │ - - - Yes Yes
|
Korean ko │ - - - Yes Yes
|
||||||
|
|
||||||
The 'large' wordlists are available in English, Spanish, French, and Portuguese.
|
The 'large' wordlists are available in English, German, Spanish, French, and
|
||||||
|
Portuguese.
|
||||||
|
|
||||||
[1] We've counted the frequencies from tweets in German, such as they are, but
|
[1] We've counted the frequencies from tweets in German, such as they are, but
|
||||||
you should be aware that German is not a frequently-used language on Twitter.
|
you should be aware that German is not a frequently-used language on Twitter.
|
||||||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue
Block a user