wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	0c7527140c	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Robyn Speer	aa7802b552	oh look, more spam Former-commit-id: `08130908c7`	2016-03-24 18:42:47 -04:00
Robyn Speer	2840ca55aa	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Robyn Speer	16841d4b0c	disregard Arabic Reddit spam Former-commit-id: `cfe68893fa`	2016-03-24 17:44:30 -04:00
Robyn Speer	034d8f540b	fix extraneous dot in intermediate filenames Former-commit-id: `6feae99381`	2016-03-24 16:52:44 -04:00
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Robyn Speer	f493d0eec4	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Andrew Lin	1942bc690f	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	12e779fc79	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Robyn Speer	6eca3cff5a	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Robyn Speer	95cdf41fe8	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	2069e30c89	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Robyn Speer	883aa5baeb	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	682e08fee2	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Robyn Speer	064ee22a33	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Robyn Speer	ab8c2e2331	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Robyn Speer	6d2709f064	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Robyn Speer	7494ae27a7	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Robyn Speer	d215f79ea3	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Robyn Speer	e6e29a1c03	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	c5d5b0b1fe	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Robyn Speer	acddc3ca05	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Robyn Speer	59363c8c44	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Robyn Speer	48f9d4520c	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Robyn Speer	4aef1dc338	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Robyn Speer	64b0b76ee1	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	14136d2a01	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00

1 2 3 4

168 Commits