Commit Graph

9 Commits

Author SHA1 Message Date
Elia Robyn Lake
bf05b1b1dc estimate the freq distribution of numbers 2022-03-10 18:33:42 -05:00
Elia Robyn Speer
c2a9fe03f1 use ftfy's uncurl_quotes in lossy_tokenize 2021-09-02 17:47:47 +00:00
Robyn Speer
08816a21d1 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
8777ad0811 remove Swahili, the data isn't reliable 2021-03-29 18:15:58 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Robyn Speer
7a32b56c1c Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Robyn Speer
42efcfc1ad relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Robyn Speer
ad0f046f47 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00