Remove Malayalam; support for it isn't ready

There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.
2024-12-23 09:21:37 +00:00 · 2021-03-30 14:08:04 -04:00 · 2021-03-30 14:08:04 -04:00 · 08816a21d1
commit 08816a21d1
parent 90f0e0a88e
3 changed files with 0 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -193,7 +193,6 @@ least 3 different sources of word frequencies:
    Lithuanian  lt      3  -      │ Yes   Yes   -     -     Yes   -     -     -
    Macedonian  mk      5  Yes    │ Yes   Yes   Yes   -     Yes   Yes   -     -
    Malay       ms      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Malayalam   ml      3  -      │ Yes   Yes   -     -     -     Yes   -     -
    Norwegian   nb [2]  5  Yes    │ Yes   Yes   -     -     Yes   Yes   Yes   -
    Persian     fa      4  -      │ Yes   Yes   -     -     Yes   Yes   -     -
    Polish      pl      6  Yes    │ Yes   Yes   Yes   -     Yes   Yes   Yes   -
--- a/tests/test_general.py
+++ b/tests/test_general.py
@ -83,7 +83,6 @@ def test_most_common_words():
    assert get_most_common('lt') == 'ir'
    assert get_most_common('lv') == 'un'
    assert get_most_common('mk') == 'на'
    assert get_most_common('ml') == 'ഒരു'
    assert get_most_common('ms') == 'yang'
    assert get_most_common('nb') == 'i'
    assert get_most_common('nl') == 'de'
--- a/wordfreq/data/small_ml.msgpack.gz
+++ b/wordfreq/data/small_ml.msgpack.gz