mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
expand list of sources and supported languages
This commit is contained in:
parent
d94428d454
commit
d9a1c34d00
10
README.md
10
README.md
@ -118,6 +118,7 @@ of word usage on different topics at different levels of formality. The sources
|
||||
- **GBooks**: Google Books Ngrams 2013
|
||||
- **LeedsIC**: The Leeds Internet Corpus
|
||||
- **OpenSub**: OpenSubtitles
|
||||
- **SUBTLEX**: The SUBTLEX word frequency lists
|
||||
- **Twitter**: Messages sampled from Twitter's public stream
|
||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
||||
|
||||
@ -128,6 +129,7 @@ at least 3 different sources of word frequencies:
|
||||
──────────────────┼──────────────────────────────────────────────────
|
||||
Arabic ar │ - - Yes Yes Yes Yes
|
||||
German de │ - - Yes Yes Yes[1] Yes
|
||||
Greek el │ - - Yes Yes Yes Yes
|
||||
English en │ Yes Yes Yes Yes Yes Yes
|
||||
Spanish es │ - - Yes Yes Yes Yes
|
||||
French fr │ - - Yes Yes Yes Yes
|
||||
@ -138,14 +140,14 @@ at least 3 different sources of word frequencies:
|
||||
Dutch nl │ - - - Yes Yes Yes
|
||||
Portuguese pt │ - - Yes Yes Yes Yes
|
||||
Russian ru │ - - Yes Yes Yes Yes
|
||||
Turkish tr │ - - - Yes Yes Yes
|
||||
|
||||
These 3 languages are only marginally supported so far, either because
|
||||
they have too few data sources, or in the case of Chinese because we are
|
||||
lacking tokenization support for it:
|
||||
These languages are only marginally supported so far. We have too few data
|
||||
sources so far in Korean (feel free to suggest some), and we are lacking
|
||||
tokenization support for Chinese.
|
||||
|
||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||
──────────────────┼──────────────────────────────────────────────────
|
||||
Greek el │ - - Yes Yes - -
|
||||
Korean ko │ - - - - Yes Yes
|
||||
Chinese zh │ - Yes Yes Yes - -
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user