wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-27 02:48:51 +00:00

Author	SHA1	Message	Date
Rob Speer	74892a0ac9	Make the almost-median deterministic when it rounds down to 0	2016-07-29 12:34:56 -04:00
Rob Speer	1a16b0f84c	Code review fixes: avoid repeatedly constructing sets	2016-07-29 12:32:26 -04:00
Rob Speer	21246f881f	Revise multilingual tests	2016-07-29 12:19:12 -04:00
Rob Speer	e6a8f028e3	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian	2016-07-28 19:23:17 -04:00
Rob Speer	fec6eddcc3	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function	2016-07-15 15:10:25 -04:00
Rob Speer	270f6c7ca6	Fix tokenization of SE Asian and South Asian scripts (#37 )	2016-07-01 18:00:57 -04:00
Rob Speer	88626aafee	wordfreq_builder: Document the extract_reddit pipeline	2016-06-02 15:19:25 -04:00
Andrew Lin	3a6d985203	Merge pull request #35 from LuminosoInsight/big-list-test-fix fix Arabic test, where 'lol' is no longer common	2016-05-11 17:20:01 -04:00
Rob Speer	da79dfb247	fix Arabic test, where 'lol' is no longer common	2016-05-11 17:01:47 -04:00
Andrew Lin	e7b34fb655	Merge pull request #34 from LuminosoInsight/big-list wordfreq 1.4: some bigger wordlists, better use of language detection	2016-05-11 16:27:51 -04:00
Rob Speer	dcb77a552b	fix to README: we're only using Reddit in English	2016-05-11 15:38:29 -04:00
Rob Speer	2276d97368	limit Reddit data to just English	2016-04-15 17:01:21 -04:00
Rob Speer	ced15d6eff	remove reddit_base_filename function	2016-03-31 13:39:13 -04:00
Rob Speer	ff1f0e4678	use `path.stem` to make the Reddit filename prefix	2016-03-31 13:13:52 -04:00
Rob Speer	16059d3b9a	rename max_size to max_words consistently	2016-03-31 12:55:18 -04:00
Rob Speer	697842b3f9	fix table showing marginal Korean support	2016-03-30 15:11:13 -04:00
Rob Speer	ed32b278cc	make an example clearer with wordlist='large'	2016-03-30 15:08:32 -04:00
Rob Speer	a10c1d7ac0	update wordlists for new builder settings	2016-03-28 12:26:47 -04:00
Rob Speer	abbc295538	Discard text detected as an uncommon language; add large German list	2016-03-28 12:26:02 -04:00
Rob Speer	08130908c7	oh look, more spam	2016-03-24 18:42:47 -04:00
Rob Speer	5b98794b86	filter out downvoted Reddit posts	2016-03-24 18:05:13 -04:00
Rob Speer	cfe68893fa	disregard Arabic Reddit spam	2016-03-24 17:44:30 -04:00
Rob Speer	6feae99381	fix extraneous dot in intermediate filenames	2016-03-24 16:52:44 -04:00
Rob Speer	1df97a579e	bump version to 1.4	2016-03-24 16:29:29 -04:00
Rob Speer	75a4a92110	actually use the results of language-detection on Reddit	2016-03-24 16:27:24 -04:00
Rob Speer	164a5b1a05	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py	2016-03-24 14:11:44 -04:00
Rob Speer	178a8b1494	make max-words a real, documented parameter	2016-03-24 14:10:02 -04:00
Rob Speer	7b539f9057	Merge pull request #33 from LuminosoInsight/bugfix Restore a missing comma.	2016-03-24 13:59:50 -04:00
Andrew Lin	38016cf62b	Restore a missing comma.	2016-03-24 13:57:18 -04:00
Andrew Lin	84497429e1	Merge pull request #32 from LuminosoInsight/thai-fix Leave Thai segments alone in the default regex	2016-03-10 11:57:44 -05:00
Rob Speer	4ec6b56faa	move Thai test to where it makes more sense	2016-03-10 11:56:15 -05:00
Rob Speer	07f16e6f03	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback.	2016-02-22 14:32:59 -05:00
Rob Speer	d79ee37da9	Add and document large wordlists	2016-01-22 16:23:43 -05:00
Rob Speer	c1a12cebec	configuration that builds some larger lists	2016-01-22 14:20:12 -05:00
Rob Speer	9907948d11	add Zipf scale	2016-01-21 14:07:01 -05:00
slibs63	d18fee3d78	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus	2016-01-14 15:52:39 -05:00
Rob Speer	8ddc19a5ca	fix documentation in wordfreq_builder.tokenizers	2016-01-13 15:18:12 -05:00
Rob Speer	511fcb6f91	reformat some argparse argument definitions	2016-01-13 12:05:07 -05:00
Rob Speer	df8caaff7d	build a bigger wordlist that we can optionally use	2016-01-12 14:05:57 -05:00
Rob Speer	8d9668d8ab	fix usage text: one comment, not one tweet	2016-01-12 13:05:38 -05:00
Rob Speer	115c74583e	Separate tokens with spaces, not line breaks, in intermediate files	2016-01-12 12:59:18 -05:00
Andrew Lin	f30efebba0	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files	2015-12-23 16:13:47 -05:00
Sara Jewett	37f9e12b93	Specify encoding when dealing with files	2015-12-23 15:49:13 -05:00
Rob Speer	973caca253	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory.	2015-12-15 14:44:34 -05:00
Rob Speer	9a5d9d66bb	gzip the intermediate step of Reddit word counting	2015-12-09 13:30:08 -05:00
Rob Speer	95f53e295b	no Thai because we can't tokenize it	2015-12-02 12:38:03 -05:00
Rob Speer	8f6cd0e57b	forgot about Italian	2015-11-30 18:18:24 -05:00
Rob Speer	5ef807117d	add tokenizer for Reddit	2015-11-30 18:16:54 -05:00
Rob Speer	2dcf368481	rebuild data files	2015-11-30 17:06:39 -05:00
Rob Speer	b2d7546d2d	add word frequencies from the Reddit 2007-2015 corpus	2015-11-30 16:38:11 -05:00

1 2 3 4 5 ...

452 Commits