wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	00e60df106	Merge branch 'master' into data-update-2.5	2021-03-29 16:42:24 -04:00
Robyn Speer	fc5c4cdda8	small documentation fixes	2021-03-29 16:41:47 -04:00
Robyn Speer	ec48c0a123	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	168bb2a6ed	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	de636a804e	Use Python packages to find dictionaries for MeCab	2021-02-18 18:18:06 -05:00
Robyn Speer	ad3a5c533f	work with langcodes 3.0, without language_data	2021-02-09 17:27:22 -05:00
Robyn Speer	c8229a5378	update the changelog	2020-10-01 16:12:41 -04:00
Robyn Speer	0ff812a711	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	26b4175f3b	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Robyn Speer	bf795e6d6c	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Lance Nathan	45a002c1e1	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Robyn Speer	d30183a7d7	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Robyn Speer	563e8f7444	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Robyn Speer	f73406c69a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Robyn Speer	4b7e3d9655	bump version to 2.1; add test requirement for pytest	2018-06-12 17:48:24 -04:00
Robyn Speer	8907423147	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Robyn Speer	666f7e51fa	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Robyn Speer	a4d9614e39	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Robyn Speer	208559ae1e	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Robyn Speer	98cb47c774	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Robyn Speer	9dac967ca3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Robyn Speer	aa3ed23282	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	82eba05f2d	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Robyn Speer	596368ac6e	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Robyn Speer	aa880bcd84	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	460fbb84fd	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
slibs63	927d4f45a4	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Sara Jewett	42d209cbe2	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Robyn Speer	23949a4512	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	6f10e71d29	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Robyn Speer	8795525372	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Robyn Speer	3ff0f30218	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Robyn Speer	090cfa7088	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Robyn Speer	83939020d0	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Robyn Speer	215eafc50b	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Robyn Speer	4c2b766f46	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Robyn Speer	2dc3d82a98	clearer error on py2 Former-commit-id: `ed19d79c5a`	2015-05-28 14:05:11 -04:00
Robyn Speer	a3cc8d403c	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Robyn Speer	7c6cf84749	update README, another setup fix Former-commit-id: `dd41e61c57`	2015-05-13 04:09:34 -04:00
Robyn Speer	c1edefa419	update dependencies Former-commit-id: `f13cca4d81`	2015-05-12 12:30:01 -04:00
Robyn Speer	fd4df8d1eb	restore missing line in setup.py Former-commit-id: `bb18f741e2`	2015-05-12 12:24:18 -04:00
Robyn Speer	aa0e844b81	add new data files from wordfreq_builder Former-commit-id: `35aec061de`	2015-05-11 18:45:47 -04:00
Robyn Speer	f92598b13d	WIP: burn stuff down Former-commit-id: `9b63e54471`	2015-05-08 15:28:52 -04:00
Robyn Speer	cb6b2a8002	v0.7: make a proper Dutch 'surfaces' list Former-commit-id: `873ace87db`	2015-04-30 13:01:24 -04:00
Robyn Speer	351378e318	Don't download the DB if the right version is already there Former-commit-id: `e931062b5a`	2013-10-31 14:12:04 -04:00

1 2

57 Commits