diff --git a/CHANGELOG.md b/CHANGELOG.md index dd5cbf3..3297924 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more frequent) and a special frequency distribution for 4-digit numbers that look like years (2010 is more frequent than 1020). -Relatedly: +More changes related to digits: - Functions such as `iter_wordlist` and `top_n_list` no longer return multi-digit numbers (they used to return them in their "smashed" form, such @@ -23,6 +23,15 @@ Relatedly: instead in a place that's internal to the `word_frequency` function, so we can look at the values of the digits before they're replaced. +Other changes: + +- wordfreq is now developed using `poetry` as its package manager, and with + `pyproject.toml` as the source of configuration instead of `setup.py`. + +- The minimum version of Python supported is 3.7. + +- Type information is exported using `py.typed`. + ## Version 2.5.1 (2021-09-02) - Import ftfy and use its `uncurl_quotes` method to turn curly quotes into diff --git a/Jenkinsfile b/Jenkinsfile deleted file mode 100644 index aca4866..0000000 --- a/Jenkinsfile +++ /dev/null @@ -1,4 +0,0 @@ -wheelJob( - upstream: [ 'wheelhouse-init' ], - extras: [ 'mecab', 'jieba' ] -) diff --git a/README.md b/README.md index d11268f..751716f 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,7 @@ in the usual way, either by getting it from pip: pip3 install wordfreq -or by getting the repository and installing it using [poetry][]: +or by getting the repository and installing it for development, using [poetry][]: poetry install @@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies. ## Usage wordfreq provides access to estimates of the frequency with which a word is -used, in 36 languages (see *Supported languages* below). It uses many different -data sources, not just one corpus. +used, in over 40 languages (see *Supported languages* below). It uses many +different data sources, not just one corpus. It provides both 'small' and 'large' wordlists: @@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil with earlier versions of wordfreq, our stand-in character is actually `0`.) This is the same form of aggregation that the word2vec vocabulary does. -Single-digit numbers are unaffected by this "binning" process; "0" through "9" have -their own entries in each language's wordlist. +Single-digit numbers are unaffected by this process; "0" through "9" have their own +entries in each language's wordlist. When asked for the frequency of a token containing multiple digits, we multiply the frequency of that aggregated entry by a distribution estimating the frequency @@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi probabilities from a distribution that peaks at the "present". I explored this in a Twitter thread at . -The part of this distribution representing the "present" is not strictly a peak; -it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books -Ngrams was updated, and 2039 is a time by which I will probably have figured out -a new distribution.) +The part of this distribution representing the "present" is not strictly a peak and +doesn't move forward with time as the present does. Instead, it's a 20-year-long +plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated, +and 2039 is a time by which I will probably have figured out a new distribution.) Some examples: @@ -172,7 +172,7 @@ Some examples: >>> word_frequency("1022", "en") 1.28e-07 -Aside from years, the distribution does **not** care about the meaning of the numbers: +Aside from years, the distribution does not care about the meaning of the numbers: >>> word_frequency("90210", "en") 3.34e-10 @@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its own distinct word list with distinct source data, there aren't actually sharp boundaries between languages. -Sometimes, it's convenient to pretend that the boundaries between -languages coincide with national borders, following the maxim that "a language -is a dialect with an army and a navy" (Max Weinreich). This gets complicated -when the linguistic situation and the political situation diverge. -Moreover, some of our data sources rely on language detection, which of course -has no idea which country the writer of the text belongs to. +Sometimes, it's convenient to pretend that the boundaries between languages +coincide with national borders, following the maxim that "a language is a +dialect with an army and a navy" (Max Weinreich). This gets complicated when the +linguistic situation and the political situation diverge. Moreover, some of our +data sources rely on language detection, which of course has no idea which +country the writer of the text belongs to. So we've had to make some arbitrary decisions about how to represent the fuzzier language boundaries, such as those within Chinese, Malay, and -Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of -the mutual intelligibility or unintelligibility of languages. - -[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633 +Croatian/Bosnian/Serbian. Smoothing over our arbitrary decisions is the fact that we use the `langcodes` module to find the best match for a language code. If you ask for word @@ -446,6 +443,9 @@ the 'cjk' feature: pip install wordfreq[cjk] +You can put `wordfreq[cjk]` in a list of dependencies, such as the +`[tool.poetry.dependencies]` list of your own project. + Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3` and `mecab-ko-dic`. diff --git a/poetry.lock b/poetry.lock index 0dcc02e..0c8b4ea 100644 --- a/poetry.lock +++ b/poetry.lock @@ -523,10 +523,15 @@ python-versions = ">=3.7" docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"] testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"] +[extras] +cjk = [] +jieba = [] +mecab = [] + [metadata] lock-version = "1.1" python-versions = "^3.7" -content-hash = "8507a13e0c8c79c30e911cc5f32bdc35284304246ae50531917df6197d7dcab8" +content-hash = "4c478694ae5eb8b3d54b635d9dc6928922ba6315c72c5061674ec0ae1068f359" [metadata.files] appnope = [ diff --git a/pyproject.toml b/pyproject.toml index b83d9ac..50f4b9c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many authors = ["Robyn Speer "] license = "MIT" readme = "README.md" +homepage = "https://github.com/rspeer/wordfreq/" [tool.poetry.dependencies] python = "^3.7" @@ -25,6 +26,11 @@ black = "^22.1.0" flake8 = "^4.0.1" types-setuptools = "^57.4.9" +[tool.poetry.extras] +cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"] +mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"] +jieba = ["jieba >= 0.42"] + [build-system] requires = ["poetry-core>=1.0.0"] build-backend = "poetry.core.masonry.api" diff --git a/setup.cfg b/setup.cfg deleted file mode 100644 index b7e4789..0000000 --- a/setup.cfg +++ /dev/null @@ -1,2 +0,0 @@ -[aliases] -test=pytest diff --git a/setup.py b/setup.py deleted file mode 100755 index acda219..0000000 --- a/setup.py +++ /dev/null @@ -1,65 +0,0 @@ -#!/usr/bin/env python -from setuptools import setup -import sys -import os - -if sys.version_info[0] < 3: - print("Sorry, but wordfreq no longer supports Python 2.") - sys.exit(1) - - -classifiers = [ - 'Intended Audience :: Developers', - 'Intended Audience :: Science/Research', - 'License :: OSI Approved :: MIT License', - 'Natural Language :: English', - 'Operating System :: MacOS', - 'Operating System :: Microsoft :: Windows', - 'Operating System :: POSIX', - 'Operating System :: Unix', - 'Programming Language :: Python :: 3', - 'Topic :: Scientific/Engineering', - 'Topic :: Software Development', - 'Topic :: Text Processing :: Linguistic', -] - -current_dir = os.path.dirname(__file__) -README_contents = open(os.path.join(current_dir, 'README.md'), - encoding='utf-8').read() -doclines = README_contents.split("\n") -dependencies = [ - 'msgpack >= 1.0', 'langcodes >= 3.0', 'regex >= 2020.04.04', 'ftfy >= 3.0' -] - -setup( - name="wordfreq", - version='3.0.0', - maintainer='Robyn Speer', - maintainer_email='rspeer@arborelia.net', - url='http://github.com/rspeer/wordfreq/', - license="MIT", - platforms=["any"], - description=doclines[0], - classifiers=classifiers, - long_description=README_contents, - long_description_content_type='text/markdown', - packages=['wordfreq'], - python_requires='>=3.7', - include_package_data=True, - install_requires=dependencies, - - # mecab-python3 is required for looking up Japanese or Korean word - # frequencies. It's not listed under 'install_requires' because wordfreq - # should be usable in other languages without it. - # - # Similarly, jieba is required for Chinese word frequencies. - extras_require={ - # previous names for extras - 'mecab': ['mecab-python3', 'ipadic', 'mecab-ko-dic'], - 'jieba': ['jieba >= 0.42'], - - # get them all at once - 'cjk': ['mecab-python3', 'ipadic', 'mecab-ko-dic', 'jieba >= 0.42'] - }, - tests_require=['pytest', 'mecab-python3', 'jieba >= 0.42', 'ipadic', 'mecab-ko-dic'], -)