Commit Graph

638 Commits

Author SHA1 Message Date
Lance Nathan
ca4681b361 Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix
Fix regex's inconsistent word breaking around apostrophes
2020-04-28 16:19:40 -04:00
Robyn Speer
0ff812a711 update version and changelog 2020-04-28 15:24:24 -04:00
Robyn Speer
13ce4606b2 fix regex's inconsistent word breaking around apostrophes 2020-04-28 15:19:56 -04:00
Robyn Speer
86ae2a610f update CHANGELOG for 2.3.1 2020-04-22 11:12:02 -04:00
Robyn Speer
26b4175f3b packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Lance Nathan
7c537134ae Merge pull request #75 from LuminosoInsight/language-match-update
use langcodes 2.0 and deprecate 'match_cutoff'
2020-04-20 14:48:58 -04:00
Robyn Speer
d45bcf97de update changelog for 2.3 2020-04-16 15:51:20 -04:00
Robyn Speer
bf795e6d6c use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Moss Collum
40443c9a3b Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix
Fix code affected by a breaking change in msgpack 1.0
2020-02-28 13:05:37 -05:00
Lance Nathan
45a002c1e1 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Lance Nathan
e043ebb481 Merge pull request #73 from LuminosoInsight/add-mailmap
Add a mailmap
2019-12-18 13:59:36 -05:00
Robyn Speer
feab8b77fb add a mailmap 2019-12-18 13:52:22 -05:00
Lance Nathan
5f085b2c17 Merge pull request #71 from LuminosoInsight/pytest-fixes
Fix a deprecation warning by using raw strings
2019-08-14 16:25:42 -04:00
Robyn Speer
7690bd5b49 fix a deprecation warning by using raw strings 2019-07-16 17:27:14 -04:00
Lance Nathan
832d8f2fdd Merge pull request #70 from LuminosoInsight/pytest-fixes
Fixes to scripts that accidentally run during tests
2019-04-16 11:41:27 -04:00
Robyn Speer
3d02a88b14 Protect top_n from running on import 2019-04-16 11:33:22 -04:00
Robyn Speer
17b1537f2f ignore the 'scripts' dir when collecting tests 2019-02-20 17:21:07 -05:00
Moss Collum
90bbacb5cb Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins
Revert "Build with Pytest on Jenkins"
2019-02-13 18:11:57 -05:00
Moss Collum
50ea040d65 Revert "Build with Pytest on Jenkins" 2019-02-13 18:11:44 -05:00
Lance Nathan
f467504835 Merge pull request #68 from LuminosoInsight/pytest-jenkins
Build with Pytest on Jenkins
2019-02-13 17:57:16 -05:00
Moss Collum
e014f1abf7 Build with Pytest on Jenkins 2019-02-13 17:56:20 -05:00
Robyn Speer
a3834180c9 update changelog for v2.2.1 2019-02-05 15:58:10 -05:00
Lance Nathan
96b9808550 Merge pull request #66 from LuminosoInsight/update-msgpack-call
Update msgpack parameter
2019-02-05 11:17:07 -05:00
Robyn Speer
dd72051929 update msgpack call in scripts/make_chinese_mapping 2019-02-05 11:16:22 -05:00
Robyn Speer
61a1604b38 update encoding='utf-8' to raw=False 2019-02-04 14:57:38 -05:00
Moss Collum
65a6a89993 Add Jenkinsfile to drive internal build scripts 2019-02-01 19:05:35 -05:00
Robyn Speer
d30183a7d7 Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Lance Nathan
c1fe37bab5 Merge pull request #62 from LuminosoInsight/name-update
Update my name and the Zenodo citation
2018-10-03 17:30:47 -04:00
Robyn Speer
563e8f7444 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Lance Nathan
2f8600e975 Merge pull request #60 from LuminosoInsight/gender-neutral-at
Recognize "@" in gender-neutral word endings as part of the token
2018-07-24 18:16:31 -04:00
Robyn Speer
287df17a71 update the changelog for version 2.2 2018-07-23 16:38:39 -04:00
Robyn Speer
f73406c69a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Robyn Speer
86b928f967 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Robyn Speer
65692c3d81 Recognize "@" in gender-neutral word endings as part of the token 2018-07-03 13:22:56 -04:00
Robyn Speer
7bf69595bb update the CHANGELOG for MeCab fix 2018-06-26 11:31:03 -04:00
Lance Nathan
0149e9ec7f Merge pull request #59 from LuminosoInsight/korean-install-fixes
Korean install fixes
2018-06-26 11:08:06 -04:00
Lance Nathan
79caa526c3 Merge pull request #58 from LuminosoInsight/significant-figures
Round wordfreq output to 3 sig. figs, and update documentation
2018-06-25 18:53:39 -04:00
Robyn Speer
830157d8e4 Fix instructions and search path for mecab-ko-dic
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.

I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.

When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00
Robyn Speer
fdf064b234 doctest the README 2018-06-18 17:11:42 -04:00
Robyn Speer
c6552f923f update README and CHANGELOG 2018-06-18 15:21:43 -04:00
Robyn Speer
7a32b56c1c Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Lance Nathan
a95b360563 Merge pull request #57 from LuminosoInsight/version2.1
Version 2.1
2018-06-18 12:06:47 -04:00
Robyn Speer
39a1308770 update table in README: Dutch has 5 sources 2018-06-18 11:43:52 -04:00
Robyn Speer
0280f82496 fix typo in previous changelog entry 2018-06-18 10:52:28 -04:00
Robyn Speer
42efcfc1ad relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Robyn Speer
ad0f046f47 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00
Robyn Speer
a975bcedae update tests to include new languages
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Robyn Speer
4b7e3d9655 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Robyn Speer
3259c4a375 Merge remote-tracking branch 'origin/pytest' into version2.1 2018-06-12 17:46:48 -04:00
Robyn Speer
d5f7335d90 New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00