The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."
chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate
dictionary, its keys are numbers. And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.
I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.
When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
Significant changes in this data include:
- Added ParaCrawl, a multilingual Web crawl, as a data source.
This supplements the Leeds Web crawl with more modern data.
ParaCrawl seems to provide a more balanced sample of Web pages than
Common Crawl, which we once considered adding, but found that its data
heavily overrepresented TripAdvisor and Urban Dictionary in a way that
was very apparent in the word frequencies.
ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
the frequencies of numbers and months.
- Fixes to inconsistencies where words from different sources were going
through different processing steps. As a result of these
inconsistencies, some word lists contained words that couldn't
actually be looked up because they would be normalized to something
else.
All words should now go through the aggressive normalization of
`lossy_tokenize`.
- Fixes to inconsistencies regarding what counts as a word.
Non-punctuation, non-emoji symbols such as `=` were slipping through
in some cases but not others.
- As a result of the new data, Latvian becomes a supported language and
Czech gets promoted to a 'large' language.