wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Lance Nathan	4c0b29f460	Merge pull request #89 from LuminosoInsight/dependencies-and-tokens Rework CJK dependencies and fix a tokenization bug	2021-02-23 15:15:17 -05:00
Robyn Speer	d99ac1051a	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	2cc58d68ad	Use Python packages to find dictionaries for MeCab	2021-02-18 18:18:06 -05:00
Robyn Speer	6b97d093b6	specifically test that the long sequence underflows to 0	2021-02-18 15:09:31 -05:00
Robyn Speer	bd57b64d00	change math for INFERRED_SPACE_FACTOR to not overflow	2021-02-18 14:44:39 -05:00
Lance Nathan	02c3cbe3fb	Merge pull request #88 from LuminosoInsight/version2.4 work with langcodes 3.0, without language_data	2021-02-09 17:36:09 -05:00
Robyn Speer	f71acec2d7	work with langcodes 3.0, without language_data	2021-02-09 17:27:22 -05:00
Robyn Speer	7a742499a4	Merge pull request #84 from LuminosoInsight/add-initial-vowels Update the "initial vowels" in French/Catalan	2021-02-03 13:47:30 -05:00
Lance Nathan	917bcdebaa	Update the "initial vowels" in French/Catalan User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82) that "Œ and œ should be considered as vowels that might appear at the start of a word in French". Further investigation of the French wordfreq list revealed words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle, d'özil). This PR is a combination of LBeaudoux's PR and the latter fact. (The updated regex is also used for Catalan, but should have no actual effect. To the best of our understanding, "y" appears in Catalan only in the digraph "ny" and in foreign words--the Catalan wordlist contains "york", "by", "city", several English names, and so forth, but no real Catalan words starting with "y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba", but cases of "l'" followed by a vowel like the ones found in French.)	2020-10-08 12:23:22 -04:00
Robyn Speer	a8915d67f7	update the changelog	2020-10-01 16:12:41 -04:00
Robyn Speer	5986342bc6	update README examples	2020-10-01 16:05:43 -04:00
Robyn Speer	fa98f0b2f6	updated frequency data	2020-09-30 17:56:12 -04:00
Lance Nathan	e3f87d4aed	Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix Fix regex's inconsistent word breaking around apostrophes	2020-04-28 16:19:40 -04:00
Robyn Speer	becf94f767	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	96e7792a4a	fix regex's inconsistent word breaking around apostrophes	2020-04-28 15:19:56 -04:00
Robyn Speer	3b7382d770	update CHANGELOG for 2.3.1	2020-04-22 11:12:02 -04:00
Robyn Speer	59f4a08920	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Lance Nathan	af22c03609	Merge pull request #75 from LuminosoInsight/language-match-update use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-20 14:48:58 -04:00
Robyn Speer	258670b823	update changelog for 2.3	2020-04-16 15:51:20 -04:00
Robyn Speer	3aeeeb64c7	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Moss Collum	33bfb1409d	Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix Fix code affected by a breaking change in msgpack 1.0	2020-02-28 13:05:37 -05:00
Lance Nathan	86e988b838	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Lance Nathan	401889d7c8	Merge pull request #73 from LuminosoInsight/add-mailmap Add a mailmap	2019-12-18 13:59:36 -05:00
Robyn Speer	f91cdb3e9b	add a mailmap	2019-12-18 13:52:22 -05:00
Lance Nathan	cea8dcbea9	Merge pull request #71 from LuminosoInsight/pytest-fixes Fix a deprecation warning by using raw strings	2019-08-14 16:25:42 -04:00
Robyn Speer	55e72977a7	fix a deprecation warning by using raw strings	2019-07-16 17:27:14 -04:00
Lance Nathan	170e3c6536	Merge pull request #70 from LuminosoInsight/pytest-fixes Fixes to scripts that accidentally run during tests	2019-04-16 11:41:27 -04:00
Robyn Speer	1f61c9b27a	Protect top_n from running on import	2019-04-16 11:33:22 -04:00
Robyn Speer	bb1bd50c44	ignore the 'scripts' dir when collecting tests	2019-02-20 17:21:07 -05:00
Moss Collum	a17587dcbb	Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:57 -05:00
Moss Collum	26cbb5a7c8	Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:44 -05:00
Lance Nathan	53ec5d87d2	Merge pull request #68 from LuminosoInsight/pytest-jenkins Build with Pytest on Jenkins	2019-02-13 17:57:16 -05:00
Moss Collum	92c3ca0a66	Build with Pytest on Jenkins	2019-02-13 17:56:20 -05:00
Robyn Speer	0931f1297d	update changelog for v2.2.1	2019-02-05 15:58:10 -05:00
Lance Nathan	1442ee044d	Merge pull request #66 from LuminosoInsight/update-msgpack-call Update msgpack parameter	2019-02-05 11:17:07 -05:00
Robyn Speer	36fd42ca08	update msgpack call in scripts/make_chinese_mapping	2019-02-05 11:16:22 -05:00
Robyn Speer	c7a14cd4ab	update encoding='utf-8' to raw=False	2019-02-04 14:57:38 -05:00
Moss Collum	0b69118558	Add Jenkinsfile to drive internal build scripts	2019-02-01 19:05:35 -05:00
Robyn Speer	4cd7b4bada	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Lance Nathan	fa8be1962b	Merge pull request #62 from LuminosoInsight/name-update Update my name and the Zenodo citation	2018-10-03 17:30:47 -04:00
Robyn Speer	51ca052b62	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Lance Nathan	bc12599010	Merge pull request #60 from LuminosoInsight/gender-neutral-at Recognize "@" in gender-neutral word endings as part of the token	2018-07-24 18:16:31 -04:00
Rob Speer	d9fc6ec42c	update the changelog for version 2.2	2018-07-23 16:38:39 -04:00
Rob Speer	0644c8920a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Rob Speer	d06a6a48c5	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Rob Speer	b2d242e8bf	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Rob Speer	ca9cf7d90f	update the CHANGELOG for MeCab fix	2018-06-26 11:31:03 -04:00
Lance Nathan	3961a28973	Merge pull request #59 from LuminosoInsight/korean-install-fixes Korean install fixes	2018-06-26 11:08:06 -04:00
Lance Nathan	a619ba6457	Merge pull request #58 from LuminosoInsight/significant-figures Round wordfreq output to 3 sig. figs, and update documentation	2018-06-25 18:53:39 -04:00
Rob Speer	676686fda1	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00

1 2 3 4 5 ...

600 Commits