wordfreq/tests/test_korean.py

from wordfreq import tokenize, word_frequency
import pytest


def test_tokens():
    assert tokenize("감사합니다", "ko") == ["감사", "합니다"]


def test_combination():
    gamsa_freq = word_frequency("감사", "ko")
    habnida_freq = word_frequency("합니다", "ko")

    assert word_frequency("감사감사", "ko") == pytest.approx(gamsa_freq / 2, rel=0.01)
    assert 1.0 / word_frequency("감사합니다", "ko") == pytest.approx(
        1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01
    )
Tokenization in Korean, plus abjad languages (#38) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: fec6eddcc3475f49a7541d8d3202ec87e581ed53 2016-07-15 19:10:25 +00:00			`from wordfreq import tokenize, word_frequency`
port remaining tests to pytest 2018-06-01 20:40:51 +00:00			`import pytest`
Tokenization in Korean, plus abjad languages (#38) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: fec6eddcc3475f49a7541d8d3202ec87e581ed53 2016-07-15 19:10:25 +00:00

			`def test_tokens():`
estimate the freq distribution of numbers 2022-03-10 23:33:42 +00:00			`assert tokenize("감사합니다", "ko") == ["감사", "합니다"]`
Tokenization in Korean, plus abjad languages (#38) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: fec6eddcc3475f49a7541d8d3202ec87e581ed53 2016-07-15 19:10:25 +00:00

			`def test_combination():`
estimate the freq distribution of numbers 2022-03-10 23:33:42 +00:00			`gamsa_freq = word_frequency("감사", "ko")`
			`habnida_freq = word_frequency("합니다", "ko")`
Tokenization in Korean, plus abjad languages (#38) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: fec6eddcc3475f49a7541d8d3202ec87e581ed53 2016-07-15 19:10:25 +00:00
estimate the freq distribution of numbers 2022-03-10 23:33:42 +00:00			`assert word_frequency("감사감사", "ko") == pytest.approx(gamsa_freq / 2, rel=0.01)`
			`assert 1.0 / word_frequency("감사합니다", "ko") == pytest.approx(`
			`1.0 / gamsa_freq + 1.0 / habnida_freq, rel=0.01`
Tokenization in Korean, plus abjad languages (#38) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: fec6eddcc3475f49a7541d8d3202ec87e581ed53 2016-07-15 19:10:25 +00:00			`)`