packaging updates

This commit is contained in:
Elia Robyn Lake 2022-03-11 10:43:37 -05:00
parent f893435b75
commit 71f2757b8b
7 changed files with 42 additions and 93 deletions

View File

@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
frequent) and a special frequency distribution for 4-digit numbers that look
like years (2010 is more frequent than 1020).
Relatedly:
More changes related to digits:
- Functions such as `iter_wordlist` and `top_n_list` no longer return
multi-digit numbers (they used to return them in their "smashed" form, such
@ -23,6 +23,15 @@ Relatedly:
instead in a place that's internal to the `word_frequency` function, so we can
look at the values of the digits before they're replaced.
Other changes:
- wordfreq is now developed using `poetry` as its package manager, and with
`pyproject.toml` as the source of configuration instead of `setup.py`.
- The minimum version of Python supported is 3.7.
- Type information is exported using `py.typed`.
## Version 2.5.1 (2021-09-02)
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into

4
Jenkinsfile vendored
View File

@ -1,4 +0,0 @@
wheelJob(
upstream: [ 'wheelhouse-init' ],
extras: [ 'mecab', 'jieba' ]
)

View File

@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
pip3 install wordfreq
or by getting the repository and installing it using [poetry][]:
or by getting the repository and installing it for development, using [poetry][]:
poetry install
@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
## Usage
wordfreq provides access to estimates of the frequency with which a word is
used, in 36 languages (see *Supported languages* below). It uses many different
data sources, not just one corpus.
used, in over 40 languages (see *Supported languages* below). It uses many
different data sources, not just one corpus.
It provides both 'small' and 'large' wordlists:
@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
is the same form of aggregation that the word2vec vocabulary does.
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
their own entries in each language's wordlist.
Single-digit numbers are unaffected by this process; "0" through "9" have their own
entries in each language's wordlist.
When asked for the frequency of a token containing multiple digits, we multiply
the frequency of that aggregated entry by a distribution estimating the frequency
@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
probabilities from a distribution that peaks at the "present". I explored this in
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
The part of this distribution representing the "present" is not strictly a peak;
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
Ngrams was updated, and 2039 is a time by which I will probably have figured out
a new distribution.)
The part of this distribution representing the "present" is not strictly a peak and
doesn't move forward with time as the present does. Instead, it's a 20-year-long
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
and 2039 is a time by which I will probably have figured out a new distribution.)
Some examples:
@ -172,7 +172,7 @@ Some examples:
>>> word_frequency("1022", "en")
1.28e-07
Aside from years, the distribution does **not** care about the meaning of the numbers:
Aside from years, the distribution does not care about the meaning of the numbers:
>>> word_frequency("90210", "en")
3.34e-10
@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
own distinct word list with distinct source data, there aren't actually sharp
boundaries between languages.
Sometimes, it's convenient to pretend that the boundaries between
languages coincide with national borders, following the maxim that "a language
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
when the linguistic situation and the political situation diverge.
Moreover, some of our data sources rely on language detection, which of course
has no idea which country the writer of the text belongs to.
Sometimes, it's convenient to pretend that the boundaries between languages
coincide with national borders, following the maxim that "a language is a
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
linguistic situation and the political situation diverge. Moreover, some of our
data sources rely on language detection, which of course has no idea which
country the writer of the text belongs to.
So we've had to make some arbitrary decisions about how to represent the
fuzzier language boundaries, such as those within Chinese, Malay, and
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
the mutual intelligibility or unintelligibility of languages.
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
Croatian/Bosnian/Serbian.
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
module to find the best match for a language code. If you ask for word
@ -446,6 +443,9 @@ the 'cjk' feature:
pip install wordfreq[cjk]
You can put `wordfreq[cjk]` in a list of dependencies, such as the
`[tool.poetry.dependencies]` list of your own project.
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
and `mecab-ko-dic`.

7
poetry.lock generated
View File

@ -523,10 +523,15 @@ python-versions = ">=3.7"
docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"]
testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"]
[extras]
cjk = []
jieba = []
mecab = []
[metadata]
lock-version = "1.1"
python-versions = "^3.7"
content-hash = "8507a13e0c8c79c30e911cc5f32bdc35284304246ae50531917df6197d7dcab8"
content-hash = "4c478694ae5eb8b3d54b635d9dc6928922ba6315c72c5061674ec0ae1068f359"
[metadata.files]
appnope = [

View File

@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
authors = ["Robyn Speer <rspeer@arborelia.net>"]
license = "MIT"
readme = "README.md"
homepage = "https://github.com/rspeer/wordfreq/"
[tool.poetry.dependencies]
python = "^3.7"
@ -25,6 +26,11 @@ black = "^22.1.0"
flake8 = "^4.0.1"
types-setuptools = "^57.4.9"
[tool.poetry.extras]
cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
jieba = ["jieba >= 0.42"]
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"

View File

@ -1,2 +0,0 @@
[aliases]
test=pytest

View File

@ -1,65 +0,0 @@
#!/usr/bin/env python
from setuptools import setup
import sys
import os
if sys.version_info[0] < 3:
print("Sorry, but wordfreq no longer supports Python 2.")
sys.exit(1)
classifiers = [
'Intended Audience :: Developers',
'Intended Audience :: Science/Research',
'License :: OSI Approved :: MIT License',
'Natural Language :: English',
'Operating System :: MacOS',
'Operating System :: Microsoft :: Windows',
'Operating System :: POSIX',
'Operating System :: Unix',
'Programming Language :: Python :: 3',
'Topic :: Scientific/Engineering',
'Topic :: Software Development',
'Topic :: Text Processing :: Linguistic',
]
current_dir = os.path.dirname(__file__)
README_contents = open(os.path.join(current_dir, 'README.md'),
encoding='utf-8').read()
doclines = README_contents.split("\n")
dependencies = [
'msgpack >= 1.0', 'langcodes >= 3.0', 'regex >= 2020.04.04', 'ftfy >= 3.0'
]
setup(
name="wordfreq",
version='3.0.0',
maintainer='Robyn Speer',
maintainer_email='rspeer@arborelia.net',
url='http://github.com/rspeer/wordfreq/',
license="MIT",
platforms=["any"],
description=doclines[0],
classifiers=classifiers,
long_description=README_contents,
long_description_content_type='text/markdown',
packages=['wordfreq'],
python_requires='>=3.7',
include_package_data=True,
install_requires=dependencies,
# mecab-python3 is required for looking up Japanese or Korean word
# frequencies. It's not listed under 'install_requires' because wordfreq
# should be usable in other languages without it.
#
# Similarly, jieba is required for Chinese word frequencies.
extras_require={
# previous names for extras
'mecab': ['mecab-python3', 'ipadic', 'mecab-ko-dic'],
'jieba': ['jieba >= 0.42'],
# get them all at once
'cjk': ['mecab-python3', 'ipadic', 'mecab-ko-dic', 'jieba >= 0.42']
},
tests_require=['pytest', 'mecab-python3', 'jieba >= 0.42', 'ipadic', 'mecab-ko-dic'],
)