wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	8d09b68d37	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Robyn Speer	2840ca55aa	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	6d2709f064	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Robyn Speer	7494ae27a7	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Robyn Speer	4aef1dc338	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	76c751652e	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Robyn Speer	f66d03b1b9	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Joshua Chin	f9742c94ca	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	34504eed80	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	c2f3928433	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	a340a15870	removed mkdir -p for many cases Former-commit-id: `98a7a8093b`	2015-07-17 14:45:22 -04:00
Robyn Speer	deed2f767c	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Robyn Speer	f17a04aa84	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Robyn Speer	91d6edd55b	Switch to a centibel scale, add a header to the data	2015-06-22 17:38:13 -04:00
Joshua Chin	6f0a082007	removed intermediate twitter file rules	2015-06-16 17:28:09 -04:00
Robyn Speer	a5954d14df	give mecab a larger buffer	2015-05-26 19:34:46 -04:00
Robyn Speer	4f738ad78c	correct a Leeds bug; add some comments to rules.ninja	2015-05-26 18:08:04 -04:00
Robyn Speer	4513fed60c	add Google Books data for English	2015-05-11 18:44:28 -04:00
Robyn Speer	aa55e32450	Makefile should only be needed for bootstrapping Ninja	2015-05-08 12:39:31 -04:00
Robyn Speer	a5f6113824	a reasonably complete build process	2015-05-07 19:38:33 -04:00
Robyn Speer	04bde8d617	WIP on more build steps	2015-05-07 16:49:53 -04:00
Robyn Speer	7c09fec692	add rules to count wikipedia tokens	2015-05-05 15:21:24 -04:00
Robyn Speer	c55e44e486	fix the 'count' ninja rule	2015-05-05 14:06:13 -04:00
Robyn Speer	59409266ca	add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary	2015-05-05 13:59:21 -04:00
Robyn Speer	efcf436112	WIP on new build system	2015-04-30 16:24:28 -04:00
Robyn Speer	76ea7f1bd5	define some ninja rules	2015-04-29 17:13:58 -04:00
Robyn Speer	524f7c760b	WIP on Ninja build automation	2015-04-29 15:59:06 -04:00

36 Commits