Commit Graph

627 Commits

Author SHA1 Message Date
Elia Robyn Lake
f3074a67be move mypy to dev dependencies 2022-04-01 12:11:39 -04:00
Elia Robyn Lake
0fc775636b packaging updates 2022-03-11 10:43:37 -05:00
Elia Robyn Lake
318097264f documentation updates 2022-03-10 19:22:53 -05:00
Elia Robyn Lake
2738737293 add py.typed 2022-03-10 19:16:38 -05:00
Elia Robyn Lake
2563eb8d72 update version and documentation 2022-03-10 19:12:45 -05:00
Elia Robyn Lake
5d6a41499b estimate the freq distribution of numbers 2022-03-10 18:33:42 -05:00
Elia Robyn Lake
a01110604b move notes to self into notes/ 2022-03-09 17:22:36 -05:00
Elia Robyn Lake
342c1d0f0e work on rel. frequencies of numbers, and other features 2022-02-18 11:33:28 -05:00
Elia Robyn Lake
538145c05c run black 2022-02-08 18:27:18 -05:00
Elia Robyn Lake
91195c793d update packaging, try to handle digits better 2022-02-08 18:24:36 -05:00
Elia Robyn Speer
11a3138cea fix merge conflict markers in setup 2021-09-02 21:49:49 +00:00
Elia Robyn Speer
cc4f39d8c2 Merge remote-tracking branch 'origin/apostrophe-consistency' 2021-09-02 18:13:53 +00:00
Elia Robyn Speer
dc9585766a use ftfy's uncurl_quotes in lossy_tokenize 2021-09-02 17:47:47 +00:00
Robyn Speer
af847699f6 update email address 2021-08-23 17:46:34 -04:00
Robyn Speer
64bbcbd51b
readme update: web text comes from OSCAR 2021-04-15 14:45:29 -04:00
Sara Jewett
c56e633d53
Merge pull request #91 from LuminosoInsight/data-update-2.5
Version 2.5, incorporating OSCAR data
2021-04-15 14:32:10 -04:00
Robyn Speer
2417ea0d39 XC was built without Russian Web data; reflect this in the table
The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in
the exquisite-corpus build.
2021-04-14 14:28:12 -04:00
Robyn Speer
81bb9f4338 Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5 2021-04-14 14:26:54 -04:00
Robyn Speer
f885a60bf0 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
08b6cea451 Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
8fd3d77e4f add OSCAR citation 2021-03-30 12:56:10 -04:00
Robyn Speer
efdf110351 Merge remote-tracking branch 'origin/master' into data-update-2.5 2021-03-30 12:53:09 -04:00
Robyn Speer
cb78887446 remove Swahili, the data isn't reliable 2021-03-29 18:15:58 -04:00
Robyn Speer
ec2e148f8e Merge branch 'master' into data-update-2.5 2021-03-29 16:42:24 -04:00
Robyn Speer
4263f1af14 small documentation fixes 2021-03-29 16:41:47 -04:00
Robyn Speer
d1949a486a update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Lance Nathan
4c0b29f460
Merge pull request #89 from LuminosoInsight/dependencies-and-tokens
Rework CJK dependencies and fix a tokenization bug
2021-02-23 15:15:17 -05:00
Robyn Speer
d99ac1051a fix version, update instructions and changelog 2021-02-18 18:25:16 -05:00
Robyn Speer
2cc58d68ad Use Python packages to find dictionaries for MeCab 2021-02-18 18:18:06 -05:00
Robyn Speer
6b97d093b6 specifically test that the long sequence underflows to 0 2021-02-18 15:09:31 -05:00
Robyn Speer
bd57b64d00 change math for INFERRED_SPACE_FACTOR to not overflow 2021-02-18 14:44:39 -05:00
Lance Nathan
02c3cbe3fb
Merge pull request #88 from LuminosoInsight/version2.4
work with langcodes 3.0, without language_data
2021-02-09 17:36:09 -05:00
Robyn Speer
f71acec2d7 work with langcodes 3.0, without language_data 2021-02-09 17:27:22 -05:00
Robyn Speer
7a742499a4
Merge pull request #84 from LuminosoInsight/add-initial-vowels
Update the "initial vowels" in French/Catalan
2021-02-03 13:47:30 -05:00
Lance Nathan
917bcdebaa Update the "initial vowels" in French/Catalan
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82)
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French".  Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil).  This PR is a combination of LBeaudoux's PR and the latter fact.

(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt".  The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
2020-10-08 12:23:22 -04:00
Robyn Speer
a8915d67f7 update the changelog 2020-10-01 16:12:41 -04:00
Robyn Speer
5986342bc6 update README examples 2020-10-01 16:05:43 -04:00
Robyn Speer
fa98f0b2f6 updated frequency data 2020-09-30 17:56:12 -04:00
Robyn Speer
174ecf580a update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00
Lance Nathan
e3f87d4aed
Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix
Fix regex's inconsistent word breaking around apostrophes
2020-04-28 16:19:40 -04:00
Robyn Speer
becf94f767 update version and changelog 2020-04-28 15:24:24 -04:00
Robyn Speer
96e7792a4a fix regex's inconsistent word breaking around apostrophes 2020-04-28 15:19:56 -04:00
Robyn Speer
3b7382d770 update CHANGELOG for 2.3.1 2020-04-22 11:12:02 -04:00
Robyn Speer
59f4a08920 packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Lance Nathan
af22c03609
Merge pull request #75 from LuminosoInsight/language-match-update
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
258670b823 update changelog for 2.3 2020-04-16 15:51:20 -04:00
Robyn Speer
3aeeeb64c7 use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Moss Collum
33bfb1409d
Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
86e988b838 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
401889d7c8
Merge pull request #73 from LuminosoInsight/add-mailmap
Add a mailmap
2019-12-18 13:59:36 -05:00