Commit Graph

585 Commits

Author SHA1 Message Date
Robyn Speer
3b7382d770 update CHANGELOG for 2.3.1 2020-04-22 11:12:02 -04:00
Robyn Speer
59f4a08920 packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Lance Nathan
af22c03609
Merge pull request #75 from LuminosoInsight/language-match-update
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
258670b823 update changelog for 2.3 2020-04-16 15:51:20 -04:00
Robyn Speer
3aeeeb64c7 use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Moss Collum
33bfb1409d
Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
86e988b838 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
401889d7c8
Merge pull request #73 from LuminosoInsight/add-mailmap
Add a mailmap
2019-12-18 13:59:36 -05:00
Robyn Speer
f91cdb3e9b add a mailmap 2019-12-18 13:52:22 -05:00
Lance Nathan
cea8dcbea9
Merge pull request #71 from LuminosoInsight/pytest-fixes
Fix a deprecation warning by using raw strings
2019-08-14 16:25:42 -04:00
Robyn Speer
55e72977a7 fix a deprecation warning by using raw strings 2019-07-16 17:27:14 -04:00
Lance Nathan
170e3c6536
Merge pull request #70 from LuminosoInsight/pytest-fixes
Fixes to scripts that accidentally run during tests
2019-04-16 11:41:27 -04:00
Robyn Speer
1f61c9b27a Protect top_n from running on import 2019-04-16 11:33:22 -04:00
Robyn Speer
bb1bd50c44 ignore the 'scripts' dir when collecting tests 2019-02-20 17:21:07 -05:00
Moss Collum
a17587dcbb
Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:57 -05:00
Moss Collum
26cbb5a7c8
Revert "Build with Pytest on Jenkins" 2019-02-13 18:11:44 -05:00
Lance Nathan
53ec5d87d2
Merge pull request #68 from LuminosoInsight/pytest-jenkins
Build with Pytest on Jenkins
2019-02-13 17:57:16 -05:00
Moss Collum
92c3ca0a66
Build with Pytest on Jenkins 2019-02-13 17:56:20 -05:00
Robyn Speer
0931f1297d update changelog for v2.2.1 2019-02-05 15:58:10 -05:00
Lance Nathan
1442ee044d
Merge pull request #66 from LuminosoInsight/update-msgpack-call
Update msgpack parameter
2019-02-05 11:17:07 -05:00
Robyn Speer
36fd42ca08 update msgpack call in scripts/make_chinese_mapping 2019-02-05 11:16:22 -05:00
Robyn Speer
c7a14cd4ab update encoding='utf-8' to raw=False 2019-02-04 14:57:38 -05:00
Moss Collum
0b69118558 Add Jenkinsfile to drive internal build scripts 2019-02-01 19:05:35 -05:00
Robyn Speer
4cd7b4bada Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Lance Nathan
fa8be1962b
Merge pull request #62 from LuminosoInsight/name-update
Update my name and the Zenodo citation
2018-10-03 17:30:47 -04:00
Robyn Speer
51ca052b62 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Lance Nathan
bc12599010
Merge pull request #60 from LuminosoInsight/gender-neutral-at
Recognize "@" in gender-neutral word endings as part of the token
2018-07-24 18:16:31 -04:00
Rob Speer
d9fc6ec42c update the changelog for version 2.2 2018-07-23 16:38:39 -04:00
Rob Speer
0644c8920a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Rob Speer
d06a6a48c5 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Rob Speer
b2d242e8bf Recognize "@" in gender-neutral word endings as part of the token 2018-07-03 13:22:56 -04:00
Rob Speer
ca9cf7d90f update the CHANGELOG for MeCab fix 2018-06-26 11:31:03 -04:00
Lance Nathan
3961a28973
Merge pull request #59 from LuminosoInsight/korean-install-fixes
Korean install fixes
2018-06-26 11:08:06 -04:00
Lance Nathan
a619ba6457
Merge pull request #58 from LuminosoInsight/significant-figures
Round wordfreq output to 3 sig. figs, and update documentation
2018-06-25 18:53:39 -04:00
Rob Speer
676686fda1 Fix instructions and search path for mecab-ko-dic
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.

I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.

When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00
Rob Speer
5e05c942ac doctest the README 2018-06-18 17:11:42 -04:00
Rob Speer
1dc763c9c5 update README and CHANGELOG 2018-06-18 15:21:43 -04:00
Rob Speer
c3b32b3c4a Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Lance Nathan
0911e90ba0
Merge pull request #57 from LuminosoInsight/version2.1
Version 2.1
2018-06-18 12:06:47 -04:00
Rob Speer
2b85a1cef2 update table in README: Dutch has 5 sources 2018-06-18 11:43:52 -04:00
Rob Speer
52aae3459d fix typo in previous changelog entry 2018-06-18 10:52:28 -04:00
Rob Speer
2f6b87c86b relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Rob Speer
57f676f4a6 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00
Rob Speer
93e3e03c60 update tests to include new languages
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Rob Speer
93ddc192d8 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Rob Speer
ff4f7bf3f6 Merge remote-tracking branch 'origin/pytest' into version2.1 2018-06-12 17:46:48 -04:00
Rob Speer
db43e0e25c New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00
Rob Speer
96a01b9685 port remaining tests to pytest 2018-06-01 16:40:51 -04:00
Rob Speer
863d5be522 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Rob Speer
8fcae9978e Use data from fixed XC build - mostly changes Chinese 2018-05-30 13:09:20 -04:00