wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	174ecf580a	update dependencies and test for consistent results	2020-09-08 16:03:33 -04:00
Robyn Speer	becf94f767	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	59f4a08920	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Robyn Speer	3aeeeb64c7	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Lance Nathan	86e988b838	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Robyn Speer	4cd7b4bada	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Robyn Speer	51ca052b62	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Rob Speer	0644c8920a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Rob Speer	93ddc192d8	bump version to 2.1; add test requirement for pytest	2018-06-12 17:48:24 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Rob Speer	a42cf312ef	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Rob Speer	aadb19c9a3	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Rob Speer	db56528fb6	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Rob Speer	37b4914970	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	ac24b8eab4	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Rob Speer	28028115c2	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
slibs63	258f5088e9	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Sara Jewett	7b6f88b059	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Rob Speer	9a1b00ba0c	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	1f5c828642	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	19e74e91c6	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Rob Speer	5d0d5f7cd2	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Rob Speer	66ad6f882e	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Rob Speer	c9c7e49465	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Rob Speer	9a46b80028	clearer error on py2 Former-commit-id: `ed19d79c5a`	2015-05-28 14:05:11 -04:00
Rob Speer	51f4e4c826	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Rob Speer	c953fc1626	update README, another setup fix Former-commit-id: `dd41e61c57`	2015-05-13 04:09:34 -04:00
Rob Speer	5cbc0d0f94	update dependencies Former-commit-id: `f13cca4d81`	2015-05-12 12:30:01 -04:00
Rob Speer	6f61cac4cb	restore missing line in setup.py Former-commit-id: `bb18f741e2`	2015-05-12 12:24:18 -04:00
Rob Speer	1c65cb9f14	add new data files from wordfreq_builder Former-commit-id: `35aec061de`	2015-05-11 18:45:47 -04:00
Rob Speer	9cd6f7c5c5	WIP: burn stuff down Former-commit-id: `9b63e54471`	2015-05-08 15:28:52 -04:00
Rob Speer	732c932ac7	v0.7: make a proper Dutch 'surfaces' list Former-commit-id: `873ace87db`	2015-04-30 13:01:24 -04:00
Rob Speer	63b465c767	Don't download the DB if the right version is already there Former-commit-id: `e931062b5a`	2013-10-31 14:12:04 -04:00
Rob Speer	8c3e8f9eb4	try being really nonspecific about functools32 versions Former-commit-id: `c1564908f2`	2013-10-31 14:06:06 -04:00
Rob Speer	676cba640f	be less specific about the functools32 version Former-commit-id: `2542cf9e35`	2013-10-31 14:02:40 -04:00
Rob Speer	40102a3f63	Normalize words when storing them or looking them up.	2013-10-30 14:59:57 -04:00
Lance	ce07c881c5	Another Py3 change, this one for functools32	2013-10-30 12:06:41 -04:00
Rob Speer	ca5b3e2f5d	Implement the data uploady downloady stuff in setup.	2013-10-29 16:44:13 -04:00
Rob Speer	bc00bb3a8b	prepare to write custom commands in setup.py	2013-10-29 12:43:41 -04:00

1 2

51 Commits