wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	c5d5b0b1fe	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Robyn Speer	acddc3ca05	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	3dd70ed1c2	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Robyn Speer	59363c8c44	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Robyn Speer	48f9d4520c	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Robyn Speer	4aef1dc338	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Robyn Speer	64b0b76ee1	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	c1f27d3095	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	e77c2dbca8	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	f7a4e2c444	update data files Former-commit-id: `25edaad962`	2015-09-04 17:00:55 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	032fea27c3	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	69d65dfda3	update data files (without the CLD2 fix yet) Former-commit-id: `a47497c908`	2015-09-04 14:58:20 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Robyn Speer	37e510345d	update README with additional SUBTLEX support Former-commit-id: `81bbe663fb`	2015-09-04 13:23:33 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	8035df998a	update the data Former-commit-id: `c11e3b7a9d`	2015-09-04 02:07:50 -04:00
Robyn Speer	14136d2a01	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Robyn Speer	3cb4dd777e	expand list of sources and supported languages Former-commit-id: `d9a1c34d00`	2015-09-04 01:03:36 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Robyn Speer	f168c37417	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Robyn Speer	76c751652e	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Robyn Speer	3446a393c5	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Robyn Speer	d267e0967c	add SUBTLEX to the readme Former-commit-id: `e6a2886a66`	2015-09-03 18:56:56 -04:00
Robyn Speer	f66d03b1b9	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Robyn Speer	42a7d5a439	Merge pull request #24 from LuminosoInsight/manifest Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `07228fdf1d`	2015-09-02 14:34:17 -04:00
Andrew Lin	2089090151	Remove the no-longer-existent .txt files from the MANIFEST. Former-commit-id: `db41bc7902`	2015-09-02 14:27:15 -04:00
Andrew Lin	4e8c15cb71	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README Former-commit-id: `e43b5ebf7b`	2015-08-28 17:59:17 -04:00
Robyn Speer	942761d2f6	fix heading Former-commit-id: `00a2812907`	2015-08-28 17:49:38 -04:00
Robyn Speer	7bdffaae5c	fix list formatting Former-commit-id: `93f44683c5`	2015-08-28 17:49:07 -04:00
Robyn Speer	247d7c6579	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Robyn Speer	44c655d9a6	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Andrew Lin	9fedede771	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer Former-commit-id: `e6d9b36203`	2015-08-27 11:56:19 -04:00
Robyn Speer	4edfab23ef	update data files Former-commit-id: `b952676679`	2015-08-27 03:58:54 -04:00
Robyn Speer	2c688b8238	copyedit regex comments Former-commit-id: `d5fcf4407e`	2015-08-26 17:04:56 -04:00
Robyn Speer	0b5d2cdca9	fix typo in docstring Former-commit-id: `34375958ef`	2015-08-26 16:24:35 -04:00

1 2 3 4 5 ...

472 Commits