wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-25 02:05:24 +00:00

Author	SHA1	Message	Date
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	ac24b8eab4	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Rob Speer	28028115c2	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
slibs63	258f5088e9	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Sara Jewett	7b6f88b059	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Rob Speer	9a1b00ba0c	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	1f5c828642	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	19e74e91c6	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Rob Speer	5d0d5f7cd2	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Rob Speer	66ad6f882e	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Rob Speer	c9c7e49465	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Rob Speer	9a46b80028	clearer error on py2 Former-commit-id: `ed19d79c5a`	2015-05-28 14:05:11 -04:00
Rob Speer	51f4e4c826	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Rob Speer	c953fc1626	update README, another setup fix Former-commit-id: `dd41e61c57`	2015-05-13 04:09:34 -04:00
Rob Speer	5cbc0d0f94	update dependencies Former-commit-id: `f13cca4d81`	2015-05-12 12:30:01 -04:00
Rob Speer	6f61cac4cb	restore missing line in setup.py Former-commit-id: `bb18f741e2`	2015-05-12 12:24:18 -04:00
Rob Speer	1c65cb9f14	add new data files from wordfreq_builder Former-commit-id: `35aec061de`	2015-05-11 18:45:47 -04:00
Rob Speer	9cd6f7c5c5	WIP: burn stuff down Former-commit-id: `9b63e54471`	2015-05-08 15:28:52 -04:00
Rob Speer	732c932ac7	v0.7: make a proper Dutch 'surfaces' list Former-commit-id: `873ace87db`	2015-04-30 13:01:24 -04:00
Rob Speer	63b465c767	Don't download the DB if the right version is already there Former-commit-id: `e931062b5a`	2013-10-31 14:12:04 -04:00
Rob Speer	8c3e8f9eb4	try being really nonspecific about functools32 versions Former-commit-id: `c1564908f2`	2013-10-31 14:06:06 -04:00
Rob Speer	676cba640f	be less specific about the functools32 version Former-commit-id: `2542cf9e35`	2013-10-31 14:02:40 -04:00
Rob Speer	40102a3f63	Normalize words when storing them or looking them up.	2013-10-30 14:59:57 -04:00
Lance	ce07c881c5	Another Py3 change, this one for functools32	2013-10-30 12:06:41 -04:00
Rob Speer	ca5b3e2f5d	Implement the data uploady downloady stuff in setup.	2013-10-29 16:44:13 -04:00
Rob Speer	bc00bb3a8b	prepare to write custom commands in setup.py	2013-10-29 12:43:41 -04:00
Rob Speer	709ca6be66	Initial version. Noticeably missing: data files or any way to get them.	2013-10-28 19:26:44 -04:00

34 Commits