wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Rob Speer	dc94222d7d	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Rob Speer	237fabb4c5	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Rob Speer	6caa9ca443	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Rob Speer	d1b667909d	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Rob Speer	7435c8f57a	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Rob Speer	88deef24f6	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Rob Speer	7cb310b28e	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Rob Speer	7f92557a58	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Rob Speer	e3cc8eaea9	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Rob Speer	5701c1165d	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Rob Speer	9c08442dc5	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Rob Speer	0f9497d864	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Rob Speer	5e86394c4c	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Rob Speer	2dfaf7798d	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Rob Speer	01332f1ed5	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Rob Speer	86475d6b5f	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Rob Speer	6bd0979ad2	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Rob Speer	8c3fb9f716	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Rob Speer	67bb55988e	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Rob Speer	11202ad7f5	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Rob Speer	30237cf73d	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Rob Speer	854247bf8b	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	e2a3758832	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Rob Speer	a555e5dc13	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Rob Speer	0441a81bbe	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Rob Speer	917ce398a2	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Rob Speer	c08e593234	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Rob Speer	3a8b2c2c81	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Rob Speer	b1d158ab41	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Rob Speer	25e24f9c32	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Rob Speer	a6ef3224a6	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Rob Speer	89763679de	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Rob Speer	ad4b12bee9	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Rob Speer	7a2f2035ab	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Rob Speer	cb5b696ffa	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Rob Speer	4aac7bdd65	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Rob Speer	49bd631632	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Rob Speer	40d6b85d67	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Rob Speer	a3b37f6619	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Rob Speer	6647cf9035	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Rob Speer	6a33b46cfd	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Andrew Lin	581dcbcae5	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	f393086253	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Rob Speer	0f0aca8320	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	b537f4ecfb	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	8004ecb790	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	0d8bf35fab	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	78324e74eb	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00

1 2 3

150 Commits