wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	c1f27d3095	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	e77c2dbca8	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	f7a4e2c444	update data files Former-commit-id: `25edaad962`	2015-09-04 17:00:55 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	032fea27c3	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	69d65dfda3	update data files (without the CLD2 fix yet) Former-commit-id: `a47497c908`	2015-09-04 14:58:20 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Robyn Speer	37e510345d	update README with additional SUBTLEX support Former-commit-id: `81bbe663fb`	2015-09-04 13:23:33 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	8035df998a	update the data Former-commit-id: `c11e3b7a9d`	2015-09-04 02:07:50 -04:00
Robyn Speer	14136d2a01	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Robyn Speer	3cb4dd777e	expand list of sources and supported languages Former-commit-id: `d9a1c34d00`	2015-09-04 01:03:36 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Robyn Speer	f168c37417	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Robyn Speer	76c751652e	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Robyn Speer	3446a393c5	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Robyn Speer	d267e0967c	add SUBTLEX to the readme Former-commit-id: `e6a2886a66`	2015-09-03 18:56:56 -04:00
Robyn Speer	f66d03b1b9	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Robyn Speer	42a7d5a439	Merge pull request #24 from LuminosoInsight/manifest Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `07228fdf1d`	2015-09-02 14:34:17 -04:00
Andrew Lin	2089090151	Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `db41bc7902`	2015-09-02 14:27:15 -04:00
Andrew Lin	4e8c15cb71	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README Former-commit-id: `e43b5ebf7b`	2015-08-28 17:59:17 -04:00
Robyn Speer	942761d2f6	fix heading Former-commit-id: `00a2812907`	2015-08-28 17:49:38 -04:00
Robyn Speer	7bdffaae5c	fix list formatting Former-commit-id: `93f44683c5`	2015-08-28 17:49:07 -04:00
Robyn Speer	247d7c6579	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Robyn Speer	44c655d9a6	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Andrew Lin	9fedede771	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer Former-commit-id: `e6d9b36203`	2015-08-27 11:56:19 -04:00
Robyn Speer	4edfab23ef	update data files Former-commit-id: `b952676679`	2015-08-27 03:58:54 -04:00
Robyn Speer	2c688b8238	copyedit regex comments Former-commit-id: `d5fcf4407e`	2015-08-26 17:04:56 -04:00
Robyn Speer	0b5d2cdca9	fix typo in docstring Former-commit-id: `34375958ef`	2015-08-26 16:24:35 -04:00
Robyn Speer	af29fc4f88	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Robyn Speer	e463397edf	correct the simple_tokenize docstring Former-commit-id: `f7babea352`	2015-08-26 13:54:50 -04:00
Robyn Speer	7fa449729b	refactor the token expression Former-commit-id: `01b6403ef4`	2015-08-26 13:40:47 -04:00
Robyn Speer	3a140ee02f	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Robyn Speer	769d8c627c	remove regex files that are no longer needed Former-commit-id: `94467a6563`	2015-08-26 11:48:11 -04:00
Robyn Speer	6f10e71d29	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Robyn Speer	a3a3180bb9	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Robyn Speer	e3658e0e42	updated data Former-commit-id: `353b8045da`	2015-08-25 17:16:03 -04:00
Robyn Speer	b22a4b0f02	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00

1 2 3 4 5 ...

363 Commits