Robyn Speer
c244ff0d10
readme update: web text comes from OSCAR
2021-04-15 14:45:29 -04:00
Sara Jewett
b13d35e503
Merge pull request #91 from LuminosoInsight/data-update-2.5
...
Version 2.5, incorporating OSCAR data
2021-04-15 14:32:10 -04:00
Robyn Speer
16122083b3
XC was built without Russian Web data; reflect this in the table
...
The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in
the exquisite-corpus build.
2021-04-14 14:28:12 -04:00
Robyn Speer
b6614c1a33
Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5
2021-04-14 14:26:54 -04:00
Robyn Speer
08816a21d1
Remove Malayalam; support for it isn't ready
...
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.
The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.
I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e
Update table, remove Galician (only two sources)
2021-03-30 13:17:36 -04:00
Robyn Speer
9bab1024b7
add OSCAR citation
2021-03-30 12:56:10 -04:00
Robyn Speer
fea45fd501
Merge remote-tracking branch 'origin/master' into data-update-2.5
2021-03-30 12:53:09 -04:00
Robyn Speer
8777ad0811
remove Swahili, the data isn't reliable
2021-03-29 18:15:58 -04:00
Robyn Speer
00e60df106
Merge branch 'master' into data-update-2.5
2021-03-29 16:42:24 -04:00
Robyn Speer
fc5c4cdda8
small documentation fixes
2021-03-29 16:41:47 -04:00
Robyn Speer
ec48c0a123
update data and tests for 2.5
2021-03-29 16:18:08 -04:00
Lance Nathan
32093d9efc
Merge pull request #89 from LuminosoInsight/dependencies-and-tokens
...
Rework CJK dependencies and fix a tokenization bug
2021-02-23 15:15:17 -05:00
Robyn Speer
168bb2a6ed
fix version, update instructions and changelog
2021-02-18 18:25:16 -05:00
Robyn Speer
de636a804e
Use Python packages to find dictionaries for MeCab
2021-02-18 18:18:06 -05:00
Robyn Speer
ed23bf3ebe
specifically test that the long sequence underflows to 0
2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb
change math for INFERRED_SPACE_FACTOR to not overflow
2021-02-18 14:44:39 -05:00
Lance Nathan
7318f58df9
Merge pull request #88 from LuminosoInsight/version2.4
...
work with langcodes 3.0, without language_data
2021-02-09 17:36:09 -05:00
Robyn Speer
ad3a5c533f
work with langcodes 3.0, without language_data
2021-02-09 17:27:22 -05:00
Robyn Speer
53b1ee2fa0
Merge pull request #84 from LuminosoInsight/add-initial-vowels
...
Update the "initial vowels" in French/Catalan
2021-02-03 13:47:30 -05:00
Lance Nathan
a31deec580
Update the "initial vowels" in French/Catalan
...
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82 )
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French". Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil). This PR is a combination of LBeaudoux's PR and the latter fact.
(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
2020-10-08 12:23:22 -04:00
Robyn Speer
c8229a5378
update the changelog
2020-10-01 16:12:41 -04:00
Robyn Speer
fd0ac9a272
update README examples
2020-10-01 16:05:43 -04:00
Robyn Speer
8c00a3c500
updated frequency data
2020-09-30 17:56:12 -04:00
Lance Nathan
ca4681b361
Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix
...
Fix regex's inconsistent word breaking around apostrophes
2020-04-28 16:19:40 -04:00
Robyn Speer
0ff812a711
update version and changelog
2020-04-28 15:24:24 -04:00
Robyn Speer
13ce4606b2
fix regex's inconsistent word breaking around apostrophes
2020-04-28 15:19:56 -04:00
Robyn Speer
86ae2a610f
update CHANGELOG for 2.3.1
2020-04-22 11:12:02 -04:00
Robyn Speer
26b4175f3b
packaging fix: require msgpack >= 1.0
2020-04-22 11:10:03 -04:00
Lance Nathan
7c537134ae
Merge pull request #75 from LuminosoInsight/language-match-update
...
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
d45bcf97de
update changelog for 2.3
2020-04-16 15:51:20 -04:00
Robyn Speer
bf795e6d6c
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-16 14:09:30 -04:00
Moss Collum
40443c9a3b
Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
...
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
45a002c1e1
Fix code affected by a breaking change in msgpack 1.0
...
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."
chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate
dictionary, its keys are numbers. And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
e043ebb481
Merge pull request #73 from LuminosoInsight/add-mailmap
...
Add a mailmap
2019-12-18 13:59:36 -05:00
Robyn Speer
feab8b77fb
add a mailmap
2019-12-18 13:52:22 -05:00
Lance Nathan
5f085b2c17
Merge pull request #71 from LuminosoInsight/pytest-fixes
...
Fix a deprecation warning by using raw strings
2019-08-14 16:25:42 -04:00
Robyn Speer
7690bd5b49
fix a deprecation warning by using raw strings
2019-07-16 17:27:14 -04:00
Lance Nathan
832d8f2fdd
Merge pull request #70 from LuminosoInsight/pytest-fixes
...
Fixes to scripts that accidentally run during tests
2019-04-16 11:41:27 -04:00
Robyn Speer
3d02a88b14
Protect top_n from running on import
2019-04-16 11:33:22 -04:00
Robyn Speer
17b1537f2f
ignore the 'scripts' dir when collecting tests
2019-02-20 17:21:07 -05:00
Moss Collum
90bbacb5cb
Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins
...
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:57 -05:00
Moss Collum
50ea040d65
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:44 -05:00
Lance Nathan
f467504835
Merge pull request #68 from LuminosoInsight/pytest-jenkins
...
Build with Pytest on Jenkins
2019-02-13 17:57:16 -05:00
Moss Collum
e014f1abf7
Build with Pytest on Jenkins
2019-02-13 17:56:20 -05:00
Robyn Speer
a3834180c9
update changelog for v2.2.1
2019-02-05 15:58:10 -05:00
Lance Nathan
96b9808550
Merge pull request #66 from LuminosoInsight/update-msgpack-call
...
Update msgpack parameter
2019-02-05 11:17:07 -05:00
Robyn Speer
dd72051929
update msgpack call in scripts/make_chinese_mapping
2019-02-05 11:16:22 -05:00
Robyn Speer
61a1604b38
update encoding='utf-8' to raw=False
2019-02-04 14:57:38 -05:00
Moss Collum
65a6a89993
Add Jenkinsfile to drive internal build scripts
2019-02-01 19:05:35 -05:00