packaging updates

2024-12-23 09:21:37 +00:00 · 2022-03-11 10:43:37 -05:00 · 2022-03-11 10:43:37 -05:00 · 71f2757b8b
commit 71f2757b8b
parent f893435b75
7 changed files with 42 additions and 93 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
 frequent) and a special frequency distribution for 4-digit numbers that look
 like years (2010 is more frequent than 1020).
-Relatedly:
+More changes related to digits:
 - Functions such as `iter_wordlist` and `top_n_list` no longer return
  multi-digit numbers (they used to return them in their "smashed" form, such
@ -23,6 +23,15 @@ Relatedly:
  instead in a place that's internal to the `word_frequency` function, so we can
  look at the values of the digits before they're replaced.
 Other changes:
 - wordfreq is now developed using `poetry` as its package manager, and with
  `pyproject.toml` as the source of configuration instead of `setup.py`.
 - The minimum version of Python supported is 3.7.
 - Type information is exported using `py.typed`.
 ## Version 2.5.1 (2021-09-02)
 - Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
--- a/4
+++ b/4
@ -1,4 +0,0 @@
 wheelJob(
    upstream: [ 'wheelhouse-init' ],
    extras: [ 'mecab', 'jieba' ]
 )
--- a/README.md
+++ b/README.md
@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
    pip3 install wordfreq
-or by getting the repository and installing it using [poetry][]:
+or by getting the repository and installing it for development, using [poetry][]:
    poetry install
@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
 ## Usage
 wordfreq provides access to estimates of the frequency with which a word is
-used, in 36 languages (see *Supported languages* below). It uses many different
+used, in over 40 languages (see *Supported languages* below). It uses many
-data sources, not just one corpus.
+different data sources, not just one corpus.
 It provides both 'small' and 'large' wordlists:
@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
 with earlier versions of wordfreq, our stand-in character is actually `0`.) This
 is the same form of aggregation that the word2vec vocabulary does.
-Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
+Single-digit numbers are unaffected by this process; "0" through "9" have their own
-their own entries in each language's wordlist.
+entries in each language's wordlist.
 When asked for the frequency of a token containing multiple digits, we multiply
 the frequency of that aggregated entry by a distribution estimating the frequency
@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
 probabilities from a distribution that peaks at the "present". I explored this in
 a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
-The part of this distribution representing the "present" is not strictly a peak;
+The part of this distribution representing the "present" is not strictly a peak and
-it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
+doesn't move forward with time as the present does. Instead, it's a 20-year-long
-Ngrams was updated, and 2039 is a time by which I will probably have figured out
+plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
-a new distribution.)
+and 2039 is a time by which I will probably have figured out a new distribution.)
 Some examples:
@ -172,7 +172,7 @@ Some examples:
    >>> word_frequency("1022", "en")
    1.28e-07
-Aside from years, the distribution does **not** care about the meaning of the numbers:
+Aside from years, the distribution does not care about the meaning of the numbers:
    >>> word_frequency("90210", "en")
    3.34e-10
@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
 own distinct word list with distinct source data, there aren't actually sharp
 boundaries between languages.
-Sometimes, it's convenient to pretend that the boundaries between
+Sometimes, it's convenient to pretend that the boundaries between languages
-languages coincide with national borders, following the maxim that "a language
+coincide with national borders, following the maxim that "a language is a
-is a dialect with an army and a navy" (Max Weinreich). This gets complicated
+dialect with an army and a navy" (Max Weinreich). This gets complicated when the
-when the linguistic situation and the political situation diverge.
+linguistic situation and the political situation diverge. Moreover, some of our
-Moreover, some of our data sources rely on language detection, which of course
+data sources rely on language detection, which of course has no idea which
-has no idea which country the writer of the text belongs to.
+country the writer of the text belongs to.
 So we've had to make some arbitrary decisions about how to represent the
 fuzzier language boundaries, such as those within Chinese, Malay, and
-Croatian/Bosnian/Serbian.  See [Language Log][] for some firsthand reports of
+Croatian/Bosnian/Serbian.
 the mutual intelligibility or unintelligibility of languages.
 [Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
 Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
 module to find the best match for a language code. If you ask for word
@ -446,6 +443,9 @@ the 'cjk' feature:
    pip install wordfreq[cjk]
 You can put `wordfreq[cjk]` in a list of dependencies, such as the
 `[tool.poetry.dependencies]` list of your own project.
 Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
 on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
 and `mecab-ko-dic`.
--- a/poetry.lock
+++ b/poetry.lock
@ -523,10 +523,15 @@ python-versions = ">=3.7"
 docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"]
 testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"]
 [extras]
 cjk = []
 jieba = []
 mecab = []
 [metadata]
 lock-version = "1.1"
 python-versions = "^3.7"
-content-hash = "8507a13e0c8c79c30e911cc5f32bdc35284304246ae50531917df6197d7dcab8"
+content-hash = "4c478694ae5eb8b3d54b635d9dc6928922ba6315c72c5061674ec0ae1068f359"
 [metadata.files]
 appnope = [
--- a/pyproject.toml
+++ b/pyproject.toml
@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
 authors = ["Robyn Speer <rspeer@arborelia.net>"]
 license = "MIT"
 readme = "README.md"
 homepage = "https://github.com/rspeer/wordfreq/"
 [tool.poetry.dependencies]
 python = "^3.7"
@ -25,6 +26,11 @@ black = "^22.1.0"
 flake8 = "^4.0.1"
 types-setuptools = "^57.4.9"
 [tool.poetry.extras]
 cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
 mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
 jieba = ["jieba >= 0.42"]
 [build-system]
 requires = ["poetry-core>=1.0.0"]
 build-backend = "poetry.core.masonry.api"
--- a/setup.cfg
+++ b/setup.cfg
@ -1,2 +0,0 @@
 [aliases]
 test=pytest
--- a/setup.py
+++ b/setup.py
@ -1,65 +0,0 @@
 #!/usr/bin/env python
 from setuptools import setup
 import sys
 import os
 if sys.version_info[0] < 3:
    print("Sorry, but wordfreq no longer supports Python 2.")
    sys.exit(1)
 classifiers = [
    'Intended Audience :: Developers',
    'Intended Audience :: Science/Research',
    'License :: OSI Approved :: MIT License',
    'Natural Language :: English',
    'Operating System :: MacOS',
    'Operating System :: Microsoft :: Windows',
    'Operating System :: POSIX',
    'Operating System :: Unix',
    'Programming Language :: Python :: 3',
    'Topic :: Scientific/Engineering',
    'Topic :: Software Development',
    'Topic :: Text Processing :: Linguistic',
 ]
 current_dir = os.path.dirname(__file__)
 README_contents = open(os.path.join(current_dir, 'README.md'),
                       encoding='utf-8').read()
 doclines = README_contents.split("\n")
 dependencies = [
    'msgpack >= 1.0', 'langcodes >= 3.0', 'regex >= 2020.04.04', 'ftfy >= 3.0'
 ]
 setup(
    name="wordfreq",
    version='3.0.0',
    maintainer='Robyn Speer',
    maintainer_email='rspeer@arborelia.net',
    url='http://github.com/rspeer/wordfreq/',
    license="MIT",
    platforms=["any"],
    description=doclines[0],
    classifiers=classifiers,
    long_description=README_contents,
    long_description_content_type='text/markdown',
    packages=['wordfreq'],
    python_requires='>=3.7',
    include_package_data=True,
    install_requires=dependencies,
    # mecab-python3 is required for looking up Japanese or Korean word
    # frequencies. It's not listed under 'install_requires' because wordfreq
    # should be usable in other languages without it.
    #
    # Similarly, jieba is required for Chinese word frequencies.
    extras_require={
        # previous names for extras
        'mecab': ['mecab-python3', 'ipadic', 'mecab-ko-dic'],
        'jieba': ['jieba >= 0.42'],
        # get them all at once
        'cjk': ['mecab-python3', 'ipadic', 'mecab-ko-dic', 'jieba >= 0.42']
    },
    tests_require=['pytest', 'mecab-python3', 'jieba >= 0.42', 'ipadic', 'mecab-ko-dic'],
 )