wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	de636a804e	Use Python packages to find dictionaries for MeCab	2021-02-18 18:18:06 -05:00
Robyn Speer	ed23bf3ebe	specifically test that the long sequence underflows to 0	2021-02-18 15:09:31 -05:00
Robyn Speer	75a56b68fb	change math for INFERRED_SPACE_FACTOR to not overflow	2021-02-18 14:44:39 -05:00
Lance Nathan	7318f58df9	Merge pull request #88 from LuminosoInsight/version2.4 work with langcodes 3.0, without language_data	2021-02-09 17:36:09 -05:00
Robyn Speer	ad3a5c533f	work with langcodes 3.0, without language_data	2021-02-09 17:27:22 -05:00
Robyn Speer	53b1ee2fa0	Merge pull request #84 from LuminosoInsight/add-initial-vowels Update the "initial vowels" in French/Catalan	2021-02-03 13:47:30 -05:00
Lance Nathan	a31deec580	Update the "initial vowels" in French/Catalan User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82) that "Œ and œ should be considered as vowels that might appear at the start of a word in French". Further investigation of the French wordfreq list revealed words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle, d'özil). This PR is a combination of LBeaudoux's PR and the latter fact. (The updated regex is also used for Catalan, but should have no actual effect. To the best of our understanding, "y" appears in Catalan only in the digraph "ny" and in foreign words--the Catalan wordlist contains "york", "by", "city", several English names, and so forth, but no real Catalan words starting with "y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba", but cases of "l'" followed by a vowel like the ones found in French.)	2020-10-08 12:23:22 -04:00
Robyn Speer	c8229a5378	update the changelog	2020-10-01 16:12:41 -04:00
Robyn Speer	fd0ac9a272	update README examples	2020-10-01 16:05:43 -04:00
Robyn Speer	8c00a3c500	updated frequency data	2020-09-30 17:56:12 -04:00
Lance Nathan	ca4681b361	Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix Fix regex's inconsistent word breaking around apostrophes	2020-04-28 16:19:40 -04:00
Robyn Speer	0ff812a711	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	13ce4606b2	fix regex's inconsistent word breaking around apostrophes	2020-04-28 15:19:56 -04:00
Robyn Speer	86ae2a610f	update CHANGELOG for 2.3.1	2020-04-22 11:12:02 -04:00
Robyn Speer	26b4175f3b	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Lance Nathan	7c537134ae	Merge pull request #75 from LuminosoInsight/language-match-update use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-20 14:48:58 -04:00
Robyn Speer	d45bcf97de	update changelog for 2.3	2020-04-16 15:51:20 -04:00
Robyn Speer	bf795e6d6c	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Moss Collum	40443c9a3b	Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix Fix code affected by a breaking change in msgpack 1.0	2020-02-28 13:05:37 -05:00
Lance Nathan	45a002c1e1	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Lance Nathan	e043ebb481	Merge pull request #73 from LuminosoInsight/add-mailmap Add a mailmap	2019-12-18 13:59:36 -05:00
Robyn Speer	feab8b77fb	add a mailmap	2019-12-18 13:52:22 -05:00
Lance Nathan	5f085b2c17	Merge pull request #71 from LuminosoInsight/pytest-fixes Fix a deprecation warning by using raw strings	2019-08-14 16:25:42 -04:00
Robyn Speer	7690bd5b49	fix a deprecation warning by using raw strings	2019-07-16 17:27:14 -04:00
Lance Nathan	832d8f2fdd	Merge pull request #70 from LuminosoInsight/pytest-fixes Fixes to scripts that accidentally run during tests	2019-04-16 11:41:27 -04:00
Robyn Speer	3d02a88b14	Protect top_n from running on import	2019-04-16 11:33:22 -04:00
Robyn Speer	17b1537f2f	ignore the 'scripts' dir when collecting tests	2019-02-20 17:21:07 -05:00
Moss Collum	90bbacb5cb	Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:57 -05:00
Moss Collum	50ea040d65	Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:44 -05:00
Lance Nathan	f467504835	Merge pull request #68 from LuminosoInsight/pytest-jenkins Build with Pytest on Jenkins	2019-02-13 17:57:16 -05:00
Moss Collum	e014f1abf7	Build with Pytest on Jenkins	2019-02-13 17:56:20 -05:00
Robyn Speer	a3834180c9	update changelog for v2.2.1	2019-02-05 15:58:10 -05:00
Lance Nathan	96b9808550	Merge pull request #66 from LuminosoInsight/update-msgpack-call Update msgpack parameter	2019-02-05 11:17:07 -05:00
Robyn Speer	dd72051929	update msgpack call in scripts/make_chinese_mapping	2019-02-05 11:16:22 -05:00
Robyn Speer	61a1604b38	update encoding='utf-8' to raw=False	2019-02-04 14:57:38 -05:00
Moss Collum	65a6a89993	Add Jenkinsfile to drive internal build scripts	2019-02-01 19:05:35 -05:00
Robyn Speer	d30183a7d7	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Lance Nathan	c1fe37bab5	Merge pull request #62 from LuminosoInsight/name-update Update my name and the Zenodo citation	2018-10-03 17:30:47 -04:00
Robyn Speer	563e8f7444	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Lance Nathan	2f8600e975	Merge pull request #60 from LuminosoInsight/gender-neutral-at Recognize "@" in gender-neutral word endings as part of the token	2018-07-24 18:16:31 -04:00
Robyn Speer	287df17a71	update the changelog for version 2.2	2018-07-23 16:38:39 -04:00
Robyn Speer	f73406c69a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Robyn Speer	86b928f967	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Robyn Speer	65692c3d81	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Robyn Speer	7bf69595bb	update the CHANGELOG for MeCab fix	2018-06-26 11:31:03 -04:00
Lance Nathan	0149e9ec7f	Merge pull request #59 from LuminosoInsight/korean-install-fixes Korean install fixes	2018-06-26 11:08:06 -04:00
Lance Nathan	79caa526c3	Merge pull request #58 from LuminosoInsight/significant-figures Round wordfreq output to 3 sig. figs, and update documentation	2018-06-25 18:53:39 -04:00
Robyn Speer	830157d8e4	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00
Robyn Speer	fdf064b234	doctest the README	2018-06-18 17:11:42 -04:00
Robyn Speer	c6552f923f	update README and CHANGELOG	2018-06-18 15:21:43 -04:00

1 2 3 4 5 ...

598 Commits