Remove Malayalam; support for it isn't ready

There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
This commit is contained in:
Robyn Speer 2021-03-30 14:08:04 -04:00
parent 90f0e0a88e
commit 08816a21d1
3 changed files with 0 additions and 2 deletions

View File

@ -193,7 +193,6 @@ least 3 different sources of word frequencies:
Lithuanian lt 3 - │ Yes Yes - - Yes - - - Lithuanian lt 3 - │ Yes Yes - - Yes - - -
Macedonian mk 5 Yes │ Yes Yes Yes - Yes Yes - - Macedonian mk 5 Yes │ Yes Yes Yes - Yes Yes - -
Malay ms 3 - │ Yes Yes - - - Yes - - Malay ms 3 - │ Yes Yes - - - Yes - -
Malayalam ml 3 - │ Yes Yes - - - Yes - -
Norwegian nb [2] 5 Yes │ Yes Yes - - Yes Yes Yes - Norwegian nb [2] 5 Yes │ Yes Yes - - Yes Yes Yes -
Persian fa 4 - │ Yes Yes - - Yes Yes - - Persian fa 4 - │ Yes Yes - - Yes Yes - -
Polish pl 6 Yes │ Yes Yes Yes - Yes Yes Yes - Polish pl 6 Yes │ Yes Yes Yes - Yes Yes Yes -

View File

@ -83,7 +83,6 @@ def test_most_common_words():
assert get_most_common('lt') == 'ir' assert get_most_common('lt') == 'ir'
assert get_most_common('lv') == 'un' assert get_most_common('lv') == 'un'
assert get_most_common('mk') == 'на' assert get_most_common('mk') == 'на'
assert get_most_common('ml') == 'ഒരു'
assert get_most_common('ms') == 'yang' assert get_most_common('ms') == 'yang'
assert get_most_common('nb') == 'i' assert get_most_common('nb') == 'i'
assert get_most_common('nl') == 'de' assert get_most_common('nl') == 'de'

Binary file not shown.