wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	8d09b68d37	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Robyn Speer	a0d93e0ce8	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Robyn Speer	5a37cc22c7	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Robyn Speer	797895047a	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Robyn Speer	a2bc90e430	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00
Robyn Speer	0c7527140c	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Robyn Speer	aa7802b552	oh look, more spam Former-commit-id: `08130908c7`	2016-03-24 18:42:47 -04:00
Robyn Speer	2840ca55aa	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Robyn Speer	16841d4b0c	disregard Arabic Reddit spam Former-commit-id: `cfe68893fa`	2016-03-24 17:44:30 -04:00
Robyn Speer	034d8f540b	fix extraneous dot in intermediate filenames Former-commit-id: `6feae99381`	2016-03-24 16:52:44 -04:00
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Robyn Speer	f493d0eec4	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Andrew Lin	1942bc690f	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	12e779fc79	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Robyn Speer	6eca3cff5a	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Robyn Speer	95cdf41fe8	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	2069e30c89	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Robyn Speer	883aa5baeb	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	682e08fee2	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Robyn Speer	064ee22a33	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Robyn Speer	ab8c2e2331	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Robyn Speer	6d2709f064	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Robyn Speer	7494ae27a7	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Robyn Speer	d215f79ea3	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Robyn Speer	e6e29a1c03	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	c5d5b0b1fe	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Robyn Speer	acddc3ca05	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Robyn Speer	59363c8c44	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Robyn Speer	48f9d4520c	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Robyn Speer	4aef1dc338	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Robyn Speer	64b0b76ee1	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00

1 2 3 4

174 Commits