mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
New data import from exquisite-corpus
Significant changes in this data include: - Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data. ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies. ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. - Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else. All words should now go through the aggressive normalization of `lossy_tokenize`. - Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as `=` were slipping through in some cases but not others. - As a result of the new data, Latvian becomes a supported language and Czech gets promoted to a 'large' language.
This commit is contained in:
parent
6235d88869
commit
d5f7335d90
File diff suppressed because it is too large
Load Diff
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
Reference in New Issue
Block a user