wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Rob Speer	16059d3b9a	rename max_size to max_words consistently	2016-03-31 12:55:18 -04:00
Rob Speer	abbc295538	Discard text detected as an uncommon language; add large German list	2016-03-28 12:26:02 -04:00
Rob Speer	08130908c7	oh look, more spam	2016-03-24 18:42:47 -04:00
Rob Speer	5b98794b86	filter out downvoted Reddit posts	2016-03-24 18:05:13 -04:00
Rob Speer	cfe68893fa	disregard Arabic Reddit spam	2016-03-24 17:44:30 -04:00
Rob Speer	6feae99381	fix extraneous dot in intermediate filenames	2016-03-24 16:52:44 -04:00
Rob Speer	75a4a92110	actually use the results of language-detection on Reddit	2016-03-24 16:27:24 -04:00
Rob Speer	164a5b1a05	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py	2016-03-24 14:11:44 -04:00
Rob Speer	178a8b1494	make max-words a real, documented parameter	2016-03-24 14:10:02 -04:00
Andrew Lin	38016cf62b	Restore a missing comma.	2016-03-24 13:57:18 -04:00
Rob Speer	d79ee37da9	Add and document large wordlists	2016-01-22 16:23:43 -05:00
Rob Speer	c1a12cebec	configuration that builds some larger lists	2016-01-22 14:20:12 -05:00
Rob Speer	8ddc19a5ca	fix documentation in wordfreq_builder.tokenizers	2016-01-13 15:18:12 -05:00
Rob Speer	511fcb6f91	reformat some argparse argument definitions	2016-01-13 12:05:07 -05:00
Rob Speer	df8caaff7d	build a bigger wordlist that we can optionally use	2016-01-12 14:05:57 -05:00
Rob Speer	8d9668d8ab	fix usage text: one comment, not one tweet	2016-01-12 13:05:38 -05:00
Rob Speer	115c74583e	Separate tokens with spaces, not line breaks, in intermediate files	2016-01-12 12:59:18 -05:00
Rob Speer	973caca253	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory.	2015-12-15 14:44:34 -05:00
Rob Speer	9a5d9d66bb	gzip the intermediate step of Reddit word counting	2015-12-09 13:30:08 -05:00
Rob Speer	95f53e295b	no Thai because we can't tokenize it	2015-12-02 12:38:03 -05:00
Rob Speer	8f6cd0e57b	forgot about Italian	2015-11-30 18:18:24 -05:00
Rob Speer	5ef807117d	add tokenizer for Reddit	2015-11-30 18:16:54 -05:00
Rob Speer	b2d7546d2d	add word frequencies from the Reddit 2007-2015 corpus	2015-11-30 16:38:11 -05:00
Rob Speer	9b1c4d66cd	fix missing word in rules.ninja comment	2015-09-24 17:56:06 -04:00
Rob Speer	f224b8dbba	describe the use of `lang` in `read_values`	2015-09-22 17:22:38 -04:00
Rob Speer	7c12f2aca1	Make the jieba_deps comment make sense	2015-09-22 17:19:00 -04:00
Rob Speer	3cb3061e06	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py	2015-09-10 15:27:33 -04:00
Rob Speer	a4f8d11427	In ninja deps, remove 'startrow' as a variable	2015-09-10 13:46:19 -04:00
Rob Speer	2277ad3116	fix spelling of Marc	2015-09-09 13:35:02 -04:00
Rob Speer	354555514f	fixes based on code review notes	2015-09-09 13:10:18 -04:00
Rob Speer	d9c44d5fcc	take out OpenSubtitles for Chinese	2015-09-08 17:25:05 -04:00
Rob Speer	bc323eccaf	update comments in wordfreq_builder.config; remove unused 'version'	2015-09-08 16:15:29 -04:00
Rob Speer	0ab23f8a28	sort Jieba wordlists consistently; update data files	2015-09-08 16:09:53 -04:00
Rob Speer	bc8ebd23e9	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient.	2015-09-08 14:46:04 -04:00
Rob Speer	715361ca0d	actually fix logic of apostrophe-fixing	2015-09-08 13:50:34 -04:00
Rob Speer	c4c1af8213	fix logic of apostrophe-fixing	2015-09-08 13:47:58 -04:00
Rob Speer	912171f8e7	fix '--language' option definition	2015-09-08 13:27:20 -04:00
Rob Speer	77a9b5c55b	Avoid Chinese tokenizer when building	2015-09-08 12:59:03 -04:00
Rob Speer	9071defb33	language-specific frequency reading; fix 't in English	2015-09-08 12:49:21 -04:00
Rob Speer	20f2828d0a	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py	2015-09-08 12:29:00 -04:00
Rob Speer	e39d345c4b	WIP: fix apostrophe trimming	2015-09-08 12:28:28 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	447d7e5134	add Polish and Swedish, which have sufficient data	2015-09-04 17:10:40 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	34474939f2	add more SUBTLEX and fix its build rules	2015-09-04 12:37:35 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00

1 2 3 4

169 Commits