mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
packaging updates
This commit is contained in:
parent
318097264f
commit
0fc775636b
11
CHANGELOG.md
11
CHANGELOG.md
@ -13,7 +13,7 @@ estimated distribution that allows for Benford's law (lower numbers are more
|
||||
frequent) and a special frequency distribution for 4-digit numbers that look
|
||||
like years (2010 is more frequent than 1020).
|
||||
|
||||
Relatedly:
|
||||
More changes related to digits:
|
||||
|
||||
- Functions such as `iter_wordlist` and `top_n_list` no longer return
|
||||
multi-digit numbers (they used to return them in their "smashed" form, such
|
||||
@ -23,6 +23,15 @@ Relatedly:
|
||||
instead in a place that's internal to the `word_frequency` function, so we can
|
||||
look at the values of the digits before they're replaced.
|
||||
|
||||
Other changes:
|
||||
|
||||
- wordfreq is now developed using `poetry` as its package manager, and with
|
||||
`pyproject.toml` as the source of configuration instead of `setup.py`.
|
||||
|
||||
- The minimum version of Python supported is 3.7.
|
||||
|
||||
- Type information is exported using `py.typed`.
|
||||
|
||||
## Version 2.5.1 (2021-09-02)
|
||||
|
||||
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
||||
|
4
Jenkinsfile
vendored
4
Jenkinsfile
vendored
@ -1,4 +0,0 @@
|
||||
wheelJob(
|
||||
upstream: [ 'wheelhouse-init' ],
|
||||
extras: [ 'mecab', 'jieba' ]
|
||||
)
|
40
README.md
40
README.md
@ -11,7 +11,7 @@ in the usual way, either by getting it from pip:
|
||||
|
||||
pip3 install wordfreq
|
||||
|
||||
or by getting the repository and installing it using [poetry][]:
|
||||
or by getting the repository and installing it for development, using [poetry][]:
|
||||
|
||||
poetry install
|
||||
|
||||
@ -23,8 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 36 languages (see *Supported languages* below). It uses many different
|
||||
data sources, not just one corpus.
|
||||
used, in over 40 languages (see *Supported languages* below). It uses many
|
||||
different data sources, not just one corpus.
|
||||
|
||||
It provides both 'small' and 'large' wordlists:
|
||||
|
||||
@ -144,8 +144,8 @@ as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibil
|
||||
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
||||
is the same form of aggregation that the word2vec vocabulary does.
|
||||
|
||||
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
|
||||
their own entries in each language's wordlist.
|
||||
Single-digit numbers are unaffected by this process; "0" through "9" have their own
|
||||
entries in each language's wordlist.
|
||||
|
||||
When asked for the frequency of a token containing multiple digits, we multiply
|
||||
the frequency of that aggregated entry by a distribution estimating the frequency
|
||||
@ -158,10 +158,10 @@ The first digits are assigned probabilities by Benford's law, and years are assi
|
||||
probabilities from a distribution that peaks at the "present". I explored this in
|
||||
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
||||
|
||||
The part of this distribution representing the "present" is not strictly a peak;
|
||||
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
|
||||
Ngrams was updated, and 2039 is a time by which I will probably have figured out
|
||||
a new distribution.)
|
||||
The part of this distribution representing the "present" is not strictly a peak and
|
||||
doesn't move forward with time as the present does. Instead, it's a 20-year-long
|
||||
plateau from 2019 to 2039. (2019 is the last time Google Books Ngrams was updated,
|
||||
and 2039 is a time by which I will probably have figured out a new distribution.)
|
||||
|
||||
Some examples:
|
||||
|
||||
@ -172,7 +172,7 @@ Some examples:
|
||||
>>> word_frequency("1022", "en")
|
||||
1.28e-07
|
||||
|
||||
Aside from years, the distribution does **not** care about the meaning of the numbers:
|
||||
Aside from years, the distribution does not care about the meaning of the numbers:
|
||||
|
||||
>>> word_frequency("90210", "en")
|
||||
3.34e-10
|
||||
@ -419,19 +419,16 @@ As much as we would like to give each language its own distinct code and its
|
||||
own distinct word list with distinct source data, there aren't actually sharp
|
||||
boundaries between languages.
|
||||
|
||||
Sometimes, it's convenient to pretend that the boundaries between
|
||||
languages coincide with national borders, following the maxim that "a language
|
||||
is a dialect with an army and a navy" (Max Weinreich). This gets complicated
|
||||
when the linguistic situation and the political situation diverge.
|
||||
Moreover, some of our data sources rely on language detection, which of course
|
||||
has no idea which country the writer of the text belongs to.
|
||||
Sometimes, it's convenient to pretend that the boundaries between languages
|
||||
coincide with national borders, following the maxim that "a language is a
|
||||
dialect with an army and a navy" (Max Weinreich). This gets complicated when the
|
||||
linguistic situation and the political situation diverge. Moreover, some of our
|
||||
data sources rely on language detection, which of course has no idea which
|
||||
country the writer of the text belongs to.
|
||||
|
||||
So we've had to make some arbitrary decisions about how to represent the
|
||||
fuzzier language boundaries, such as those within Chinese, Malay, and
|
||||
Croatian/Bosnian/Serbian. See [Language Log][] for some firsthand reports of
|
||||
the mutual intelligibility or unintelligibility of languages.
|
||||
|
||||
[Language Log]: http://languagelog.ldc.upenn.edu/nll/?p=12633
|
||||
Croatian/Bosnian/Serbian.
|
||||
|
||||
Smoothing over our arbitrary decisions is the fact that we use the `langcodes`
|
||||
module to find the best match for a language code. If you ask for word
|
||||
@ -446,6 +443,9 @@ the 'cjk' feature:
|
||||
|
||||
pip install wordfreq[cjk]
|
||||
|
||||
You can put `wordfreq[cjk]` in a list of dependencies, such as the
|
||||
`[tool.poetry.dependencies]` list of your own project.
|
||||
|
||||
Tokenizing Chinese depends on the `jieba` package, tokenizing Japanese depends
|
||||
on `mecab-python3` and `ipadic`, and tokenizing Korean depends on `mecab-python3`
|
||||
and `mecab-ko-dic`.
|
||||
|
7
poetry.lock
generated
7
poetry.lock
generated
@ -523,10 +523,15 @@ python-versions = ">=3.7"
|
||||
docs = ["sphinx", "jaraco.packaging (>=8.2)", "rst.linker (>=1.9)"]
|
||||
testing = ["pytest (>=6)", "pytest-checkdocs (>=2.4)", "pytest-flake8", "pytest-cov", "pytest-enabler (>=1.0.1)", "jaraco.itertools", "func-timeout", "pytest-black (>=0.3.7)", "pytest-mypy"]
|
||||
|
||||
[extras]
|
||||
cjk = []
|
||||
jieba = []
|
||||
mecab = []
|
||||
|
||||
[metadata]
|
||||
lock-version = "1.1"
|
||||
python-versions = "^3.7"
|
||||
content-hash = "8507a13e0c8c79c30e911cc5f32bdc35284304246ae50531917df6197d7dcab8"
|
||||
content-hash = "4c478694ae5eb8b3d54b635d9dc6928922ba6315c72c5061674ec0ae1068f359"
|
||||
|
||||
[metadata.files]
|
||||
appnope = [
|
||||
|
@ -5,6 +5,7 @@ description = "Look up the frequencies of words in many languages, based on many
|
||||
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
||||
license = "MIT"
|
||||
readme = "README.md"
|
||||
homepage = "https://github.com/rspeer/wordfreq/"
|
||||
|
||||
[tool.poetry.dependencies]
|
||||
python = "^3.7"
|
||||
@ -25,6 +26,11 @@ black = "^22.1.0"
|
||||
flake8 = "^4.0.1"
|
||||
types-setuptools = "^57.4.9"
|
||||
|
||||
[tool.poetry.extras]
|
||||
cjk = ["mecab-python3", "ipadic", "mecab-ko-dic", "jieba >= 0.42"]
|
||||
mecab = ["mecab-python3", "ipadic", "mecab-ko-dic"]
|
||||
jieba = ["jieba >= 0.42"]
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core>=1.0.0"]
|
||||
build-backend = "poetry.core.masonry.api"
|
||||
|
65
setup.py
65
setup.py
@ -1,65 +0,0 @@
|
||||
#!/usr/bin/env python
|
||||
from setuptools import setup
|
||||
import sys
|
||||
import os
|
||||
|
||||
if sys.version_info[0] < 3:
|
||||
print("Sorry, but wordfreq no longer supports Python 2.")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
classifiers = [
|
||||
'Intended Audience :: Developers',
|
||||
'Intended Audience :: Science/Research',
|
||||
'License :: OSI Approved :: MIT License',
|
||||
'Natural Language :: English',
|
||||
'Operating System :: MacOS',
|
||||
'Operating System :: Microsoft :: Windows',
|
||||
'Operating System :: POSIX',
|
||||
'Operating System :: Unix',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Topic :: Scientific/Engineering',
|
||||
'Topic :: Software Development',
|
||||
'Topic :: Text Processing :: Linguistic',
|
||||
]
|
||||
|
||||
current_dir = os.path.dirname(__file__)
|
||||
README_contents = open(os.path.join(current_dir, 'README.md'),
|
||||
encoding='utf-8').read()
|
||||
doclines = README_contents.split("\n")
|
||||
dependencies = [
|
||||
'msgpack >= 1.0', 'langcodes >= 3.0', 'regex >= 2020.04.04', 'ftfy >= 3.0'
|
||||
]
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='3.0.0',
|
||||
maintainer='Robyn Speer',
|
||||
maintainer_email='rspeer@arborelia.net',
|
||||
url='http://github.com/rspeer/wordfreq/',
|
||||
license="MIT",
|
||||
platforms=["any"],
|
||||
description=doclines[0],
|
||||
classifiers=classifiers,
|
||||
long_description=README_contents,
|
||||
long_description_content_type='text/markdown',
|
||||
packages=['wordfreq'],
|
||||
python_requires='>=3.7',
|
||||
include_package_data=True,
|
||||
install_requires=dependencies,
|
||||
|
||||
# mecab-python3 is required for looking up Japanese or Korean word
|
||||
# frequencies. It's not listed under 'install_requires' because wordfreq
|
||||
# should be usable in other languages without it.
|
||||
#
|
||||
# Similarly, jieba is required for Chinese word frequencies.
|
||||
extras_require={
|
||||
# previous names for extras
|
||||
'mecab': ['mecab-python3', 'ipadic', 'mecab-ko-dic'],
|
||||
'jieba': ['jieba >= 0.42'],
|
||||
|
||||
# get them all at once
|
||||
'cjk': ['mecab-python3', 'ipadic', 'mecab-ko-dic', 'jieba >= 0.42']
|
||||
},
|
||||
tests_require=['pytest', 'mecab-python3', 'jieba >= 0.42', 'ipadic', 'mecab-ko-dic'],
|
||||
)
|
Loading…
Reference in New Issue
Block a user