Commit Graph

638 Commits

Author SHA1 Message Date
Elia Robyn Lake
6fc77b4b29 simplify deps by updating pytest 2022-10-25 14:25:00 -04:00
Elia Robyn Lake
59103c52b9 update citation 2022-10-25 14:24:32 -04:00
Elia Robyn Lake
ba424e6c2d update to Apache license 2022-10-25 14:20:23 -04:00
Elia Robyn Lake
d8f10b7fb8 fix tox setup, test on python 3.11 2022-10-25 13:59:13 -04:00
Elia Robyn Lake
b24094b726 update changelog 2022-09-26 17:54:52 -04:00
Elia Robyn Lake (Robyn Speer)
287a7602a5 Merge pull request #97 from synapticarbors/patch-1
Include license file in source distribution
2022-09-26 17:48:55 -04:00
Elia Robyn Lake
f722248f4d Merge branch 'master' of github.com:rspeer/wordfreq 2022-09-26 17:47:59 -04:00
Elia Robyn Lake
f1926de486 fix version dependency of regex 2022-09-26 17:47:06 -04:00
Elia Robyn Lake (Robyn Speer)
a6cf11f94d Merge pull request #105 from xxyzz/add-optional-deps
Add extras packages to `tool.poetry.dependencies` in pyproject.toml
2022-09-26 17:46:38 -04:00
xxyzz
fae9f5843d Add extras packages to tool.poetry.dependencies in pyproject.toml
Extras dependencies need to be added as optional dependencies, otherwise they
won't be installed.

Document: https://python-poetry.org/docs/pyproject/#extras
Poetry GitHub issue: https://github.com/python-poetry/poetry/issues/5604
2022-09-25 13:26:17 +08:00
Elia Robyn Lake
19535d08ef move mypy to dev dependencies 2022-04-01 12:11:39 -04:00
Elia Robyn Lake
71f2757b8b packaging updates 2022-03-11 10:43:37 -05:00
Elia Robyn Lake
f893435b75 documentation updates 2022-03-10 19:22:53 -05:00
Elia Robyn Lake
981fab53aa add py.typed 2022-03-10 19:16:38 -05:00
Elia Robyn Lake
ed7dccbf8b update version and documentation 2022-03-10 19:12:45 -05:00
Elia Robyn Lake
bf05b1b1dc estimate the freq distribution of numbers 2022-03-10 18:33:42 -05:00
Elia Robyn Lake
4e373750e8 move notes to self into notes/ 2022-03-09 17:22:36 -05:00
Elia Robyn Lake
f800ff9bcc work on rel. frequencies of numbers, and other features 2022-02-18 11:33:28 -05:00
Elia Robyn Lake
ef4d6fe0df run black 2022-02-08 18:27:18 -05:00
Elia Robyn Lake
3c4819e7e5 update packaging, try to handle digits better 2022-02-08 18:24:36 -05:00
Joshua Adelman
60f7baba5d Include license file in source distribution 2021-10-19 15:30:59 -04:00
Elia Robyn Speer
2361606b3a fix merge conflict markers in setup 2021-09-02 21:49:49 +00:00
Elia Robyn Speer
b60ac1b803 Merge remote-tracking branch 'origin/apostrophe-consistency' 2021-09-02 18:13:53 +00:00
Elia Robyn Speer
c2a9fe03f1 use ftfy's uncurl_quotes in lossy_tokenize 2021-09-02 17:47:47 +00:00
Robyn Speer
6f1f626f1b update email address 2021-08-23 17:46:34 -04:00
Robyn Speer
c244ff0d10 readme update: web text comes from OSCAR 2021-04-15 14:45:29 -04:00
Sara Jewett
b13d35e503 Merge pull request #91 from LuminosoInsight/data-update-2.5
Version 2.5, incorporating OSCAR data
2021-04-15 14:32:10 -04:00
Robyn Speer
16122083b3 XC was built without Russian Web data; reflect this in the table
The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in
the exquisite-corpus build.
2021-04-14 14:28:12 -04:00
Robyn Speer
b6614c1a33 Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5 2021-04-14 14:26:54 -04:00
Robyn Speer
08816a21d1 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
9bab1024b7 add OSCAR citation 2021-03-30 12:56:10 -04:00
Robyn Speer
fea45fd501 Merge remote-tracking branch 'origin/master' into data-update-2.5 2021-03-30 12:53:09 -04:00
Robyn Speer
8777ad0811 remove Swahili, the data isn't reliable 2021-03-29 18:15:58 -04:00
Robyn Speer
00e60df106 Merge branch 'master' into data-update-2.5 2021-03-29 16:42:24 -04:00
Robyn Speer
fc5c4cdda8 small documentation fixes 2021-03-29 16:41:47 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Lance Nathan
32093d9efc Merge pull request #89 from LuminosoInsight/dependencies-and-tokens
Rework CJK dependencies and fix a tokenization bug
2021-02-23 15:15:17 -05:00
Robyn Speer
168bb2a6ed fix version, update instructions and changelog 2021-02-18 18:25:16 -05:00
Robyn Speer
de636a804e Use Python packages to find dictionaries for MeCab 2021-02-18 18:18:06 -05:00
Robyn Speer
ed23bf3ebe specifically test that the long sequence underflows to 0 2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb change math for INFERRED_SPACE_FACTOR to not overflow 2021-02-18 14:44:39 -05:00
Lance Nathan
7318f58df9 Merge pull request #88 from LuminosoInsight/version2.4
work with langcodes 3.0, without language_data
2021-02-09 17:36:09 -05:00
Robyn Speer
ad3a5c533f work with langcodes 3.0, without language_data 2021-02-09 17:27:22 -05:00
Robyn Speer
53b1ee2fa0 Merge pull request #84 from LuminosoInsight/add-initial-vowels
Update the "initial vowels" in French/Catalan
2021-02-03 13:47:30 -05:00
Lance Nathan
a31deec580 Update the "initial vowels" in French/Catalan
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82)
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French".  Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil).  This PR is a combination of LBeaudoux's PR and the latter fact.

(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt".  The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
2020-10-08 12:23:22 -04:00
Robyn Speer
c8229a5378 update the changelog 2020-10-01 16:12:41 -04:00
Robyn Speer
fd0ac9a272 update README examples 2020-10-01 16:05:43 -04:00
Robyn Speer
8c00a3c500 updated frequency data 2020-09-30 17:56:12 -04:00
Robyn Speer
ad02d96f1b update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00