Significant changes in this data include:
- Added ParaCrawl, a multilingual Web crawl, as a data source.
This supplements the Leeds Web crawl with more modern data.
ParaCrawl seems to provide a more balanced sample of Web pages than
Common Crawl, which we once considered adding, but found that its data
heavily overrepresented TripAdvisor and Urban Dictionary in a way that
was very apparent in the word frequencies.
ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
the frequencies of numbers and months.
- Fixes to inconsistencies where words from different sources were going
through different processing steps. As a result of these
inconsistencies, some word lists contained words that couldn't
actually be looked up because they would be normalized to something
else.
All words should now go through the aggressive normalization of
`lossy_tokenize`.
- Fixes to inconsistencies regarding what counts as a word.
Non-punctuation, non-emoji symbols such as `=` were slipping through
in some cases but not others.
- As a result of the new data, Latvian becomes a supported language and
Czech gets promoted to a 'large' language.
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
This is the result of re-running exquisite-corpus via wordfreq 2. The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.
The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.
In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3