mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
packaging updates
This commit is contained in:
parent
f893435b75
commit
71f2757b8b
11
CHANGELOG.md
11
CHANGELOG.md
@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
|
|||||||
frequent) and a special frequency distribution for 4-digit numbers that look
|
frequent) and a special frequency distribution for 4-digit numbers that look
|
||||||
like years (2010 is more frequent than 1020).
|
like years (2010 is more frequent than 1020).
|
||||||
|
|
||||||
Relatedly:
|
More changes related to digits:
|
||||||
|
|
||||||
- Functions such as `iter_wordlist` and `top_n_list` no longer return
|
- Functions such as `iter_wordlist` and `top_n_list` no longer return
|
||||||
multi-digit numbers (they used to return them in their "smashed" form, such
|
multi-digit numbers (they used to return them in their "smashed" form, such
|
||||||
@ -23,6 +23,15 @@ Relatedly:
|
|||||||
instead in a place that's internal to the `word_frequency` function, so we can
|
instead in a place that's internal to the `word_frequency` function, so we can
|
||||||
look at the values of the digits before they're replaced.
|
look at the values of the digits before they're replaced.
|
||||||
|
|
||||||
|
Other changes:
|
||||||
|
|
||||||
|
- wordfreq is now developed using `poetry` as its package manager, and with
|
||||||
|
`pyproject.toml` as the source of configuration instead of `setup.py`.
|
||||||
|
|
||||||
|
- The minimum version of Python supported is 3.7.
|
||||||
|
|
||||||
|
- Type information is exported using `py.typed`.
|
||||||
|
|
||||||
## Version 2.5.1 (2021-09-02)
|
## Version 2.5.1 (2021-09-02)
|
||||||
|
|
||||||
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
||||||
|
4
Jenkinsfile
vendored
4
Jenkinsfile
vendored
@ -1,4 +0,0 @@
|
|||||||
wheelJob(
|
|
||||||
upstream: [ 'wheelhouse-init' ],
|
|
||||||
extras: [ 'mecab', 'jieba' ]
|
|
||||||
)
|
|
40
README.md
40
README.md
@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
|
|||||||
|
|
||||||
pip3 install wordfreq
|
pip3 install wordfreq
|
||||||
|
|
||||||
or by getting the repository and installing it using [poetry][]:
|
or by getting the repository and installing it for development, using [poetry][]:
|
||||||
|
|
||||||
poetry install
|
poetry install
|
||||||
|
|
||||||
@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 36 languages (see *Supported languages* below). It uses many different
|
used, in over 40 languages (see *Supported languages* below). It uses many
|
||||||
data sources, not just one corpus.
|
different data sources, not just one corpus.
|
||||||
|
|
||||||
It provides both 'small' and 'large' wordlists:
|
It provides both 'small' and 'large' wordlists:
|
||||||
|
|
||||||
@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
|
|||||||
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
||||||
is the same form of aggregation that the word2vec vocabulary does.
|
is the same form of aggregation that the word2vec vocabulary does.
|
||||||
|
|
||||||
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
|
Single-digit numbers are unaffected by this process; "0" through "9" have their own
|
||||||
their own entries in each language's wordlist.
|
entries in each language's wordlist.
|
||||||
|
|
||||||
When asked for the frequency of a token containing multiple digits, we multiply
|
When asked for the frequency of a token containing multiple digits, we multiply
|
||||||
the frequency of that aggregated entry by a distribution estimating the frequency
|
the frequency of that aggregated entry by a distribution estimating the frequency
|
||||||
@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
|
|||||||
probabilities from a distribution that peaks at the "present". I explored this in
|
probabilities from a distribution that peaks at the "present". I explored this in
|
||||||
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
||||||
|
|
||||||
The part of this distribution representing the "present" is not strictly a peak;
|
The part of this distribution representing the "present" is not strictly a peak and
|
||||||
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
|
doesn't move forward with time as the present does. Instead, it's a 20-year-long
|
||||||
Ngrams was updated, and 2039 is a time by which I will probably have figured out
|
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
|
||||||
a new distribution.)
|
and 2039 is a time by which I will probably have figured out a new distribution.)
|
||||||
|
|
||||||
Some examples:
|
Some examples:
|
||||||
|
|
||||||
@ -172,7 +172,7 @@ Some examples:
|
|||||||
>>> word_frequency("1022", "en")
|
>>> word_frequency("1022", "en")
|
||||||
1.28e-07
|
1.28e-07
|
||||||
|
|
||||||
Aside from years, the distribution does **not** care about the meaning of the numbers:
|
Aside from years, the distribution does not care about the meaning of the numbers:
|
||||||
|
|
||||||
>>> word_frequency("90210", "en")
|
>>> word_frequency("90210", "en")
|
||||||
3.34e-10
|
3.34e-10
|
||||||
@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
|
|||||||
own distinct word list with distinct source data, there aren't actually sharp
|
own distinct word list with distinct source data, there aren't actually sharp
|
||||||
boundaries between languages.
|
boundaries between languages.
|
||||||
|
|
||||||
Sometimes, it's convenient to pretend that the boundaries between
|
Sometimes, it's convenient to pretend that the boundaries between languages
|
||||||
languages coincide with national borders, following the maxim that "a language
|
coincide with national borders, following the maxim that "a language is a
|
||||||
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
|
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
|
||||||
when the linguistic situation and the political situation diverge.
|
linguistic situation and the political situation diverge. Moreover, some of our
|
||||||
Moreover, some of our data sources rely on language detection, which of course
|
data sources rely on language detection, which of course has no idea which
|
||||||
has no idea which country the writer of the text belongs to.
|
country the writer of the text belongs to.
|
||||||
|
|
||||||
So we've had to make some arbitrary decisions about how to represent the
|
So we've had to make some arbitrary decisions about how to represent the
|
||||||
fuzzier language boundaries, such as those within Chinese, Malay, and
|
fuzzier language boundaries, such as those within Chinese, Malay, and
|
||||||
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
|
Croatian/Bosnian/Serbian.
|
||||||
the mutual intelligibility or unintelligibility of languages.
|
|
||||||
|
|
||||||
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
|
|
||||||
|
|
||||||
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
|
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
|
||||||
module to find the best match for a language code. If you ask for word
|
module to find the best match for a language code. If you ask for word
|
||||||
@ -446,6 +443,9 @@ the 'cjk' feature:
|
|||||||
|
|
||||||
pip install wordfreq[cjk]
|
pip install wordfreq[cjk]
|
||||||
|
|
||||||
|
You can put `wordfreq[cjk]` in a list of dependencies, such as the
|
||||||
|
`[tool.poetry.dependencies]` list of your own project.
|
||||||
|
|
||||||
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
|
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
|
||||||
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
|
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
|
||||||
and `mecab-ko-dic`.
|
and `mecab-ko-dic`.
|
||||||
|
7
poetry.lock
generated
7
poetry.lock
generated
@ -523,10 +523,15 @@ python-versions = ">=3.7"
|
|||||||
docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"]
|
docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"]
|
||||||
testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"]
|
testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"]
|
||||||
|
|
||||||
|
[extras]
|
||||||
|
cjk = []
|
||||||
|
jieba = []
|
||||||
|
mecab = []
|
||||||
|
|
||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "1.1"
|
lock-version = "1.1"
|
||||||
python-versions = "^3.7"
|
python-versions = "^3.7"
|
||||||
content-hash = "8507a13e0c8c79c30e911cc5f32bdc35284304246ae50531917df6197d7dcab8"
|
content-hash = "4c478694ae5eb8b3d54b635d9dc6928922ba6315c72c5061674ec0ae1068f359"
|
||||||
|
|
||||||
[metadata.files]
|
[metadata.files]
|
||||||
appnope = [
|
appnope = [
|
||||||
|
@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
|
|||||||
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
||||||
license = "MIT"
|
license = "MIT"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
|
homepage = "https://github.com/rspeer/wordfreq/"
|
||||||
|
|
||||||
[tool.poetry.dependencies]
|
[tool.poetry.dependencies]
|
||||||
python = "^3.7"
|
python = "^3.7"
|
||||||
@ -25,6 +26,11 @@ black = "^22.1.0"
|
|||||||
flake8 = "^4.0.1"
|
flake8 = "^4.0.1"
|
||||||
types-setuptools = "^57.4.9"
|
types-setuptools = "^57.4.9"
|
||||||
|
|
||||||
|
[tool.poetry.extras]
|
||||||
|
cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
|
||||||
|
mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
|
||||||
|
jieba = ["jieba >= 0.42"]
|
||||||
|
|
||||||
[build-system]
|
[build-system]
|
||||||
requires = ["poetry-core>=1.0.0"]
|
requires = ["poetry-core>=1.0.0"]
|
||||||
build-backend = "poetry.core.masonry.api"
|
build-backend = "poetry.core.masonry.api"
|
||||||
|
65
setup.py
65
setup.py
@ -1,65 +0,0 @@
|
|||||||
#!/usr/bin/env python
|
|
||||||
from setuptools import setup
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
|
|
||||||
if sys.version_info[0] < 3:
|
|
||||||
print("Sorry, but wordfreq no longer supports Python 2.")
|
|
||||||
sys.exit(1)
|
|
||||||
|
|
||||||
|
|
||||||
classifiers = [
|
|
||||||
'Intended Audience :: Developers',
|
|
||||||
'Intended Audience :: Science/Research',
|
|
||||||
'License :: OSI Approved :: MIT License',
|
|
||||||
'Natural Language :: English',
|
|
||||||
'Operating System :: MacOS',
|
|
||||||
'Operating System :: Microsoft :: Windows',
|
|
||||||
'Operating System :: POSIX',
|
|
||||||
'Operating System :: Unix',
|
|
||||||
'Programming Language :: Python :: 3',
|
|
||||||
'Topic :: Scientific/Engineering',
|
|
||||||
'Topic :: Software Development',
|
|
||||||
'Topic :: Text Processing :: Linguistic',
|
|
||||||
]
|
|
||||||
|
|
||||||
current_dir = os.path.dirname(__file__)
|
|
||||||
README_contents = open(os.path.join(current_dir, 'README.md'),
|
|
||||||
encoding='utf-8').read()
|
|
||||||
doclines = README_contents.split("\n")
|
|
||||||
dependencies = [
|
|
||||||
'msgpack >= 1.0', 'langcodes >= 3.0', 'regex >= 2020.04.04', 'ftfy >= 3.0'
|
|
||||||
]
|
|
||||||
|
|
||||||
setup(
|
|
||||||
name="wordfreq",
|
|
||||||
version='3.0.0',
|
|
||||||
maintainer='Robyn Speer',
|
|
||||||
maintainer_email='rspeer@arborelia.net',
|
|
||||||
url='http://github.com/rspeer/wordfreq/',
|
|
||||||
license="MIT",
|
|
||||||
platforms=["any"],
|
|
||||||
description=doclines[0],
|
|
||||||
classifiers=classifiers,
|
|
||||||
long_description=README_contents,
|
|
||||||
long_description_content_type='text/markdown',
|
|
||||||
packages=['wordfreq'],
|
|
||||||
python_requires='>=3.7',
|
|
||||||
include_package_data=True,
|
|
||||||
install_requires=dependencies,
|
|
||||||
|
|
||||||
# mecab-python3 is required for looking up Japanese or Korean word
|
|
||||||
# frequencies. It's not listed under 'install_requires' because wordfreq
|
|
||||||
# should be usable in other languages without it.
|
|
||||||
#
|
|
||||||
# Similarly, jieba is required for Chinese word frequencies.
|
|
||||||
extras_require={
|
|
||||||
# previous names for extras
|
|
||||||
'mecab': ['mecab-python3', 'ipadic', 'mecab-ko-dic'],
|
|
||||||
'jieba': ['jieba >= 0.42'],
|
|
||||||
|
|
||||||
# get them all at once
|
|
||||||
'cjk': ['mecab-python3', 'ipadic', 'mecab-ko-dic', 'jieba >= 0.42']
|
|
||||||
},
|
|
||||||
tests_require=['pytest', 'mecab-python3', 'jieba >= 0.42', 'ipadic', 'mecab-ko-dic'],
|
|
||||||
)
|
|
Loading…
Reference in New Issue
Block a user