wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-27 10:58:52 +00:00

Author	SHA1	Message	Date
Rob Speer	30237cf73d	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Rob Speer	854247bf8b	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Rob Speer	b4100b5bfb	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	e2a3758832	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Rob Speer	62f5a8eb1e	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Rob Speer	a555e5dc13	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Rob Speer	1d4a18ead2	update data files Former-commit-id: `25edaad962`	2015-09-04 17:00:55 -04:00
Rob Speer	63295fc397	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Rob Speer	0441a81bbe	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Rob Speer	917ce398a2	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Rob Speer	138e8aaa3f	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Rob Speer	c08e593234	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Rob Speer	a8161b1067	update data files (without the CLD2 fix yet) Former-commit-id: `a47497c908`	2015-09-04 14:58:20 -04:00
Rob Speer	3a8b2c2c81	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Rob Speer	a0997a79a4	update README with additional SUBTLEX support Former-commit-id: `81bbe663fb`	2015-09-04 13:23:33 -04:00
Rob Speer	b1d158ab41	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Rob Speer	f993ffcdf2	update the data Former-commit-id: `c11e3b7a9d`	2015-09-04 02:07:50 -04:00
Rob Speer	25e24f9c32	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Rob Speer	bf88f97744	expand list of sources and supported languages Former-commit-id: `d9a1c34d00`	2015-09-04 01:03:36 -04:00
Rob Speer	a6ef3224a6	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Rob Speer	89763679de	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Rob Speer	ad4b12bee9	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Rob Speer	7a2f2035ab	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Rob Speer	a92c398258	add SUBTLEX to the readme Former-commit-id: `e6a2886a66`	2015-09-03 18:56:56 -04:00
Rob Speer	cb5b696ffa	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Rob Speer	531cedbf55	Merge pull request #24 from LuminosoInsight/manifest Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `07228fdf1d`	2015-09-02 14:34:17 -04:00
Andrew Lin	65d6645e81	Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `db41bc7902`	2015-09-02 14:27:15 -04:00
Andrew Lin	b693715663	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README Former-commit-id: `e43b5ebf7b`	2015-08-28 17:59:17 -04:00
Rob Speer	d883eaeca5	fix heading Former-commit-id: `00a2812907`	2015-08-28 17:49:38 -04:00
Rob Speer	390a431181	fix list formatting Former-commit-id: `93f44683c5`	2015-08-28 17:49:07 -04:00
Rob Speer	4aac7bdd65	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Rob Speer	43fd15c938	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Andrew Lin	5a47427f6e	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer Former-commit-id: `e6d9b36203`	2015-08-27 11:56:19 -04:00
Rob Speer	db5a4502b8	update data files Former-commit-id: `b952676679`	2015-08-27 03:58:54 -04:00
Rob Speer	001180ca86	copyedit regex comments Former-commit-id: `d5fcf4407e`	2015-08-26 17:04:56 -04:00
Rob Speer	dae953525e	fix typo in docstring Former-commit-id: `34375958ef`	2015-08-26 16:24:35 -04:00
Rob Speer	49bd631632	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Rob Speer	6286946cc3	correct the simple_tokenize docstring Former-commit-id: `f7babea352`	2015-08-26 13:54:50 -04:00
Rob Speer	232aee9c66	refactor the token expression Former-commit-id: `01b6403ef4`	2015-08-26 13:40:47 -04:00
Rob Speer	40d6b85d67	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Rob Speer	7a757d9ec9	remove regex files that are no longer needed Former-commit-id: `94467a6563`	2015-08-26 11:48:11 -04:00
Rob Speer	1f5c828642	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Rob Speer	d064fbec7d	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Rob Speer	244735ce4d	updated data Former-commit-id: `353b8045da`	2015-08-25 17:16:03 -04:00
Rob Speer	a3b37f6619	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Rob Speer	1042f87efe	exclude 'extenders' from the start of the token Former-commit-id: `a8e7c29068`	2015-08-25 12:33:12 -04:00
Rob Speer	a5b8c5a745	update frequency lists Former-commit-id: `0d600bdf27`	2015-08-25 11:43:59 -04:00
Rob Speer	99a312ce06	Exclude math and modifier symbols as tokens Former-commit-id: `8f3c9f576c`	2015-08-25 11:43:22 -04:00
Rob Speer	6647cf9035	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00

1 2 3 4 5 ...

359 Commits