wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 02:28:50 +00:00

Author	SHA1	Message	Date
Rob Speer	1aa63bca6c	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	f539eecdd6	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Rob Speer	c5bdc3c6bd	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Rob Speer	6f11256ed1	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Rob Speer	d924c8e2a5	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Rob Speer	9adc5b92f8	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00
Rob Speer	3e34dbdd38	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Rob Speer	1c4a2077a4	oh look, more spam Former-commit-id: `08130908c7`	2016-03-24 18:42:47 -04:00
Rob Speer	cebf99f7ba	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Rob Speer	fe6d8fea85	disregard Arabic Reddit spam Former-commit-id: `cfe68893fa`	2016-03-24 17:44:30 -04:00
Rob Speer	d2cc42936f	fix extraneous dot in intermediate filenames Former-commit-id: `6feae99381`	2016-03-24 16:52:44 -04:00
Rob Speer	c3364ef821	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Rob Speer	a5fcfd100d	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Rob Speer	670ab12f54	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Andrew Lin	c85146e156	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Rob Speer	23c5c4adca	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Rob Speer	3b95d349e0	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Rob Speer	ee8cfb5a50	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Rob Speer	56f830d678	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Rob Speer	f4761029d0	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Rob Speer	83bd019efe	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Rob Speer	1d3485c855	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Rob Speer	6d62a8ff51	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Rob Speer	4e985e3bca	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Rob Speer	dc94222d7d	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Rob Speer	237fabb4c5	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Rob Speer	6caa9ca443	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Rob Speer	d1b667909d	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Rob Speer	7435c8f57a	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Rob Speer	88deef24f6	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Rob Speer	7cb310b28e	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Rob Speer	7f92557a58	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Rob Speer	e3cc8eaea9	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Rob Speer	5701c1165d	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Rob Speer	9c08442dc5	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Rob Speer	0f9497d864	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Rob Speer	5e86394c4c	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Rob Speer	2dfaf7798d	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Rob Speer	01332f1ed5	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Rob Speer	86475d6b5f	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Rob Speer	6bd0979ad2	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Rob Speer	8c3fb9f716	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Rob Speer	67bb55988e	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Rob Speer	11202ad7f5	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Rob Speer	30237cf73d	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Rob Speer	854247bf8b	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	e2a3758832	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Rob Speer	a555e5dc13	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00

1 2 3 4

175 Commits