wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

History

Robyn Speer d5f7335d90 New data import from exquisite-corpus Significant changes in this data include: - Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data. ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies. ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. - Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else. All words should now go through the aggressive normalization of `lossy_tokenize`. - Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as `=` were slipping through in some cases but not others. - As a result of the new data, Latvian becomes a supported language and Czech gets promoted to a 'large' language.		2018-06-12 17:22:43 -04:00
..
data	New data import from exquisite-corpus	2018-06-12 17:22:43 -04:00
__init__.py	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
chinese.py	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
language_info.py	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
mecab.py	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
preprocess.py	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
tokens.py	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
transliterate.py	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00