readme update: web text comes from OSCAR

This commit is contained in:
Robyn Speer 2021-04-15 14:45:29 -04:00 committed by GitHub
parent b13d35e503
commit c244ff0d10

View File

@ -153,8 +153,7 @@ come from multiple sources:
- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX - **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
- **News**, from NewsCrawl 2014 and GlobalVoices - **News**, from NewsCrawl 2014 and GlobalVoices
- **Books**, from Google Books Ngrams 2012 - **Books**, from Google Books Ngrams 2012
- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK - **Web** text, from OSCAR
Hungarian Webcorpus
- **Twitter**, representing short-form social media - **Twitter**, representing short-form social media
- **Reddit**, representing potentially longer Internet comments - **Reddit**, representing potentially longer Internet comments
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist - **Miscellaneous** word frequencies: in Chinese, we import a free wordlist