Commit Graph

620 Commits

Author SHA1 Message Date
Elia Robyn Lake
f800ff9bcc work on rel. frequencies of numbers, and other features 2022-02-18 11:33:28 -05:00
Elia Robyn Lake
ef4d6fe0df run black 2022-02-08 18:27:18 -05:00
Elia Robyn Lake
3c4819e7e5 update packaging, try to handle digits better 2022-02-08 18:24:36 -05:00
Elia Robyn Speer
2361606b3a fix merge conflict markers in setup 2021-09-02 21:49:49 +00:00
Elia Robyn Speer
b60ac1b803 Merge remote-tracking branch 'origin/apostrophe-consistency' 2021-09-02 18:13:53 +00:00
Elia Robyn Speer
c2a9fe03f1 use ftfy's uncurl_quotes in lossy_tokenize 2021-09-02 17:47:47 +00:00
Robyn Speer
6f1f626f1b update email address 2021-08-23 17:46:34 -04:00
Robyn Speer
c244ff0d10 readme update: web text comes from OSCAR 2021-04-15 14:45:29 -04:00
Sara Jewett
b13d35e503 Merge pull request #91 from LuminosoInsight/data-update-2.5
Version 2.5, incorporating OSCAR data
2021-04-15 14:32:10 -04:00
Robyn Speer
16122083b3 XC was built without Russian Web data; reflect this in the table
The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in
the exquisite-corpus build.
2021-04-14 14:28:12 -04:00
Robyn Speer
b6614c1a33 Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5 2021-04-14 14:26:54 -04:00
Robyn Speer
08816a21d1 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
9bab1024b7 add OSCAR citation 2021-03-30 12:56:10 -04:00
Robyn Speer
fea45fd501 Merge remote-tracking branch 'origin/master' into data-update-2.5 2021-03-30 12:53:09 -04:00
Robyn Speer
8777ad0811 remove Swahili, the data isn't reliable 2021-03-29 18:15:58 -04:00
Robyn Speer
00e60df106 Merge branch 'master' into data-update-2.5 2021-03-29 16:42:24 -04:00
Robyn Speer
fc5c4cdda8 small documentation fixes 2021-03-29 16:41:47 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Lance Nathan
32093d9efc Merge pull request #89 from LuminosoInsight/dependencies-and-tokens
Rework CJK dependencies and fix a tokenization bug
2021-02-23 15:15:17 -05:00
Robyn Speer
168bb2a6ed fix version, update instructions and changelog 2021-02-18 18:25:16 -05:00
Robyn Speer
de636a804e Use Python packages to find dictionaries for MeCab 2021-02-18 18:18:06 -05:00
Robyn Speer
ed23bf3ebe specifically test that the long sequence underflows to 0 2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb change math for INFERRED_SPACE_FACTOR to not overflow 2021-02-18 14:44:39 -05:00
Lance Nathan
7318f58df9 Merge pull request #88 from LuminosoInsight/version2.4
work with langcodes 3.0, without language_data
2021-02-09 17:36:09 -05:00
Robyn Speer
ad3a5c533f work with langcodes 3.0, without language_data 2021-02-09 17:27:22 -05:00
Robyn Speer
53b1ee2fa0 Merge pull request #84 from LuminosoInsight/add-initial-vowels
Update the "initial vowels" in French/Catalan
2021-02-03 13:47:30 -05:00
Lance Nathan
a31deec580 Update the "initial vowels" in French/Catalan
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82)
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French".  Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil).  This PR is a combination of LBeaudoux's PR and the latter fact.

(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt".  The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
2020-10-08 12:23:22 -04:00
Robyn Speer
c8229a5378 update the changelog 2020-10-01 16:12:41 -04:00
Robyn Speer
fd0ac9a272 update README examples 2020-10-01 16:05:43 -04:00
Robyn Speer
8c00a3c500 updated frequency data 2020-09-30 17:56:12 -04:00
Robyn Speer
ad02d96f1b update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00
Lance Nathan
ca4681b361 Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix
Fix regex's inconsistent word breaking around apostrophes
2020-04-28 16:19:40 -04:00
Robyn Speer
0ff812a711 update version and changelog 2020-04-28 15:24:24 -04:00
Robyn Speer
13ce4606b2 fix regex's inconsistent word breaking around apostrophes 2020-04-28 15:19:56 -04:00
Robyn Speer
86ae2a610f update CHANGELOG for 2.3.1 2020-04-22 11:12:02 -04:00
Robyn Speer
26b4175f3b packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Lance Nathan
7c537134ae Merge pull request #75 from LuminosoInsight/language-match-update
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
d45bcf97de update changelog for 2.3 2020-04-16 15:51:20 -04:00
Robyn Speer
bf795e6d6c use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Moss Collum
40443c9a3b Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
45a002c1e1 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
e043ebb481 Merge pull request #73 from LuminosoInsight/add-mailmap
Add a mailmap
2019-12-18 13:59:36 -05:00
Robyn Speer
feab8b77fb add a mailmap 2019-12-18 13:52:22 -05:00
Lance Nathan
5f085b2c17 Merge pull request #71 from LuminosoInsight/pytest-fixes
Fix a deprecation warning by using raw strings
2019-08-14 16:25:42 -04:00
Robyn Speer
7690bd5b49 fix a deprecation warning by using raw strings 2019-07-16 17:27:14 -04:00
Lance Nathan
832d8f2fdd Merge pull request #70 from LuminosoInsight/pytest-fixes
Fixes to scripts that accidentally run during tests
2019-04-16 11:41:27 -04:00
Robyn Speer
3d02a88b14 Protect top_n from running on import 2019-04-16 11:33:22 -04:00
Robyn Speer
17b1537f2f ignore the 'scripts' dir when collecting tests 2019-02-20 17:21:07 -05:00
Moss Collum
90bbacb5cb Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:57 -05:00