Lance Nathan
4c0b29f460
Merge pull request #89 from LuminosoInsight/dependencies-and-tokens
...
Rework CJK dependencies and fix a tokenization bug
2021-02-23 15:15:17 -05:00
Robyn Speer
d99ac1051a
fix version, update instructions and changelog
2021-02-18 18:25:16 -05:00
Robyn Speer
2cc58d68ad
Use Python packages to find dictionaries for MeCab
2021-02-18 18:18:06 -05:00
Robyn Speer
6b97d093b6
specifically test that the long sequence underflows to 0
2021-02-18 15:09:31 -05:00
Robyn Speer
bd57b64d00
change math for INFERRED_SPACE_FACTOR to not overflow
2021-02-18 14:44:39 -05:00
Lance Nathan
02c3cbe3fb
Merge pull request #88 from LuminosoInsight/version2.4
...
work with langcodes 3.0, without language_data
2021-02-09 17:36:09 -05:00
Robyn Speer
f71acec2d7
work with langcodes 3.0, without language_data
2021-02-09 17:27:22 -05:00
Robyn Speer
7a742499a4
Merge pull request #84 from LuminosoInsight/add-initial-vowels
...
Update the "initial vowels" in French/Catalan
2021-02-03 13:47:30 -05:00
Lance Nathan
917bcdebaa
Update the "initial vowels" in French/Catalan
...
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82 )
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French". Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil). This PR is a combination of LBeaudoux's PR and the latter fact.
(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
2020-10-08 12:23:22 -04:00
Robyn Speer
a8915d67f7
update the changelog
2020-10-01 16:12:41 -04:00
Robyn Speer
5986342bc6
update README examples
2020-10-01 16:05:43 -04:00
Robyn Speer
fa98f0b2f6
updated frequency data
2020-09-30 17:56:12 -04:00
Lance Nathan
e3f87d4aed
Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix
...
Fix regex's inconsistent word breaking around apostrophes
2020-04-28 16:19:40 -04:00
Robyn Speer
becf94f767
update version and changelog
2020-04-28 15:24:24 -04:00
Robyn Speer
96e7792a4a
fix regex's inconsistent word breaking around apostrophes
2020-04-28 15:19:56 -04:00
Robyn Speer
3b7382d770
update CHANGELOG for 2.3.1
2020-04-22 11:12:02 -04:00
Robyn Speer
59f4a08920
packaging fix: require msgpack >= 1.0
2020-04-22 11:10:03 -04:00
Lance Nathan
af22c03609
Merge pull request #75 from LuminosoInsight/language-match-update
...
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
258670b823
update changelog for 2.3
2020-04-16 15:51:20 -04:00
Robyn Speer
3aeeeb64c7
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-16 14:09:30 -04:00
Moss Collum
33bfb1409d
Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
...
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
86e988b838
Fix code affected by a breaking change in msgpack 1.0
...
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."
chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate
dictionary, its keys are numbers. And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
401889d7c8
Merge pull request #73 from LuminosoInsight/add-mailmap
...
Add a mailmap
2019-12-18 13:59:36 -05:00
Robyn Speer
f91cdb3e9b
add a mailmap
2019-12-18 13:52:22 -05:00
Lance Nathan
cea8dcbea9
Merge pull request #71 from LuminosoInsight/pytest-fixes
...
Fix a deprecation warning by using raw strings
2019-08-14 16:25:42 -04:00
Robyn Speer
55e72977a7
fix a deprecation warning by using raw strings
2019-07-16 17:27:14 -04:00
Lance Nathan
170e3c6536
Merge pull request #70 from LuminosoInsight/pytest-fixes
...
Fixes to scripts that accidentally run during tests
2019-04-16 11:41:27 -04:00
Robyn Speer
1f61c9b27a
Protect top_n from running on import
2019-04-16 11:33:22 -04:00
Robyn Speer
bb1bd50c44
ignore the 'scripts' dir when collecting tests
2019-02-20 17:21:07 -05:00
Moss Collum
a17587dcbb
Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins
...
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:57 -05:00
Moss Collum
26cbb5a7c8
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:44 -05:00
Lance Nathan
53ec5d87d2
Merge pull request #68 from LuminosoInsight/pytest-jenkins
...
Build with Pytest on Jenkins
2019-02-13 17:57:16 -05:00
Moss Collum
92c3ca0a66
Build with Pytest on Jenkins
2019-02-13 17:56:20 -05:00
Robyn Speer
0931f1297d
update changelog for v2.2.1
2019-02-05 15:58:10 -05:00
Lance Nathan
1442ee044d
Merge pull request #66 from LuminosoInsight/update-msgpack-call
...
Update msgpack parameter
2019-02-05 11:17:07 -05:00
Robyn Speer
36fd42ca08
update msgpack call in scripts/make_chinese_mapping
2019-02-05 11:16:22 -05:00
Robyn Speer
c7a14cd4ab
update encoding='utf-8' to raw=False
2019-02-04 14:57:38 -05:00
Moss Collum
0b69118558
Add Jenkinsfile to drive internal build scripts
2019-02-01 19:05:35 -05:00
Robyn Speer
4cd7b4bada
Allow a wider range of 'regex' versions
...
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Lance Nathan
fa8be1962b
Merge pull request #62 from LuminosoInsight/name-update
...
Update my name and the Zenodo citation
2018-10-03 17:30:47 -04:00
Robyn Speer
51ca052b62
Update my name and the Zenodo citation
2018-10-03 17:27:10 -04:00
Lance Nathan
bc12599010
Merge pull request #60 from LuminosoInsight/gender-neutral-at
...
Recognize "@" in gender-neutral word endings as part of the token
2018-07-24 18:16:31 -04:00
Rob Speer
d9fc6ec42c
update the changelog for version 2.2
2018-07-23 16:38:39 -04:00
Rob Speer
0644c8920a
Update README to describe @ tokenization
2018-07-23 11:21:44 -04:00
Rob Speer
d06a6a48c5
include data from xc rebuild
2018-07-15 01:01:35 -04:00
Rob Speer
b2d242e8bf
Recognize "@" in gender-neutral word endings as part of the token
2018-07-03 13:22:56 -04:00
Rob Speer
ca9cf7d90f
update the CHANGELOG for MeCab fix
2018-06-26 11:31:03 -04:00
Lance Nathan
3961a28973
Merge pull request #59 from LuminosoInsight/korean-install-fixes
...
Korean install fixes
2018-06-26 11:08:06 -04:00
Lance Nathan
a619ba6457
Merge pull request #58 from LuminosoInsight/significant-figures
...
Round wordfreq output to 3 sig. figs, and update documentation
2018-06-25 18:53:39 -04:00
Rob Speer
676686fda1
Fix instructions and search path for mecab-ko-dic
...
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.
I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.
When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00