wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Author	SHA1	Message	Date
Elia Robyn Speer	11a3138cea	fix merge conflict markers in setup	2021-09-02 21:49:49 +00:00
Elia Robyn Speer	cc4f39d8c2	Merge remote-tracking branch 'origin/apostrophe-consistency'	2021-09-02 18:13:53 +00:00
Elia Robyn Speer	dc9585766a	use ftfy's uncurl_quotes in lossy_tokenize	2021-09-02 17:47:47 +00:00
Robyn Speer	af847699f6	update email address	2021-08-23 17:46:34 -04:00
Robyn Speer	ec2e148f8e	Merge branch 'master' into data-update-2.5	2021-03-29 16:42:24 -04:00
Robyn Speer	4263f1af14	small documentation fixes	2021-03-29 16:41:47 -04:00
Robyn Speer	d1949a486a	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	d99ac1051a	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	2cc58d68ad	Use Python packages to find dictionaries for MeCab	2021-02-18 18:18:06 -05:00
Robyn Speer	f71acec2d7	work with langcodes 3.0, without language_data	2021-02-09 17:27:22 -05:00
Robyn Speer	a8915d67f7	update the changelog	2020-10-01 16:12:41 -04:00
Robyn Speer	174ecf580a	update dependencies and test for consistent results	2020-09-08 16:03:33 -04:00
Robyn Speer	becf94f767	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	59f4a08920	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Robyn Speer	3aeeeb64c7	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Lance Nathan	86e988b838	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Robyn Speer	4cd7b4bada	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Robyn Speer	51ca052b62	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Rob Speer	0644c8920a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Rob Speer	93ddc192d8	bump version to 2.1; add test requirement for pytest	2018-06-12 17:48:24 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Rob Speer	a42cf312ef	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Rob Speer	aadb19c9a3	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Rob Speer	db56528fb6	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Rob Speer	37b4914970	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	ac24b8eab4	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Rob Speer	28028115c2	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
slibs63	258f5088e9	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Sara Jewett	7b6f88b059	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Rob Speer	9a1b00ba0c	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	1f5c828642	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	19e74e91c6	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Rob Speer	5d0d5f7cd2	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Rob Speer	66ad6f882e	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Rob Speer	c9c7e49465	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Rob Speer	9a46b80028	clearer error on py2 Former-commit-id: `ed19d79c5a`	2015-05-28 14:05:11 -04:00
Rob Speer	51f4e4c826	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Rob Speer	c953fc1626	update README, another setup fix Former-commit-id: `dd41e61c57`	2015-05-13 04:09:34 -04:00
Rob Speer	5cbc0d0f94	update dependencies Former-commit-id: `f13cca4d81`	2015-05-12 12:30:01 -04:00

1 2

62 Commits