wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 18:38:51 +00:00

Author	SHA1	Message	Date
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	f539eecdd6	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Rob Speer	cebf99f7ba	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Rob Speer	c3364ef821	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Rob Speer	f4761029d0	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Rob Speer	6d62a8ff51	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Rob Speer	4e985e3bca	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Rob Speer	d1b667909d	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Rob Speer	7435c8f57a	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Rob Speer	01332f1ed5	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Rob Speer	11202ad7f5	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	e2a3758832	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Rob Speer	b1d158ab41	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Rob Speer	ad4b12bee9	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Rob Speer	cb5b696ffa	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Joshua Chin	78324e74eb	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	0a2f2877af	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	5c7e0dd0dd	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	631a5f1b71	removed mkdir -p for many cases Former-commit-id: `98a7a8093b`	2015-07-17 14:45:22 -04:00
Rob Speer	4771c12814	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Rob Speer	9a2855394d	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Rob Speer	f305679caf	Switch to a centibel scale, add a header to the data	2015-06-22 17:38:13 -04:00
Joshua Chin	da93bc89c2	removed intermediate twitter file rules	2015-06-16 17:28:09 -04:00
Rob Speer	536c15fbdb	give mecab a larger buffer	2015-05-26 19:34:46 -04:00
Rob Speer	ffd352f148	correct a Leeds bug; add some comments to rules.ninja	2015-05-26 18:08:04 -04:00
Rob Speer	50ff85ce19	add Google Books data for English	2015-05-11 18:44:28 -04:00
Rob Speer	d6cc90792f	Makefile should only be needed for bootstrapping Ninja	2015-05-08 12:39:31 -04:00
Rob Speer	abb0e059c8	a reasonably complete build process	2015-05-07 19:38:33 -04:00
Rob Speer	d2f9c60776	WIP on more build steps	2015-05-07 16:49:53 -04:00
Rob Speer	16928ed182	add rules to count wikipedia tokens	2015-05-05 15:21:24 -04:00
Rob Speer	bd579e2319	fix the 'count' ninja rule	2015-05-05 14:06:13 -04:00
Rob Speer	5787b6bb73	add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary	2015-05-05 13:59:21 -04:00
Rob Speer	5437bb4e85	WIP on new build system	2015-04-30 16:24:28 -04:00
Rob Speer	4dae2f8caf	define some ninja rules	2015-04-29 17:13:58 -04:00
Rob Speer	14e445a937	WIP on Ninja build automation	2015-04-29 15:59:06 -04:00

36 Commits