wordfreq/wordfreq
Robyn Speer d5f7335d90 New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00
..
data New data import from exquisite-corpus 2018-06-12 17:22:43 -04:00
__init__.py More explicit error message for a missing wordlist 2018-03-14 15:10:27 -04:00
chinese.py Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
language_info.py Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
mecab.py Allow MeCab to work in Japanese or Korean without the other 2016-08-19 11:41:35 -04:00
preprocess.py fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00
tokens.py Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
transliterate.py fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00