Subtle changes to CJK frequencies

This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
This commit is contained in:
Robyn Speer 2018-03-14 11:18:24 -04:00
parent 0cb36aa74f
commit d68d4baad2
47 changed files with 1 additions and 1 deletions

View File

@ -257,7 +257,7 @@ into multiple tokens:
>>> zipf_frequency('New York', 'en')
5.35
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
3.55
3.54
The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be. In Chinese,

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.