wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	ce5a91d732	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Robyn Speer	15667ea023	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Robyn Speer	68c6d95131	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	8d09b68d37	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Andrew Lin	046ca4cda3	Merge pull request #35 from LuminosoInsight/big-list-test-fix fix Arabic test, where 'lol' is no longer common Former-commit-id: `3a6d985203`	2016-05-11 17:20:01 -04:00
Robyn Speer	c72326e4c0	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Andrew Lin	7a55e0ed86	Merge pull request #34 from LuminosoInsight/big-list wordfreq 1.4: some bigger wordlists, better use of language detection Former-commit-id: `e7b34fb655`	2016-05-11 16:27:51 -04:00
Robyn Speer	1ac6795709	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Robyn Speer	a0d93e0ce8	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Robyn Speer	5a37cc22c7	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Robyn Speer	797895047a	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Robyn Speer	a2bc90e430	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00
Robyn Speer	a9a4483ca3	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Robyn Speer	36885b5479	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Robyn Speer	cecf852040	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Robyn Speer	0c7527140c	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Robyn Speer	aa7802b552	oh look, more spam Former-commit-id: `08130908c7`	2016-03-24 18:42:47 -04:00
Robyn Speer	2840ca55aa	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Robyn Speer	16841d4b0c	disregard Arabic Reddit spam Former-commit-id: `cfe68893fa`	2016-03-24 17:44:30 -04:00
Robyn Speer	034d8f540b	fix extraneous dot in intermediate filenames Former-commit-id: `6feae99381`	2016-03-24 16:52:44 -04:00
Robyn Speer	460fbb84fd	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Robyn Speer	f493d0eec4	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Robyn Speer	298cb69353	Merge pull request #33 from LuminosoInsight/bugfix Restore a missing comma. Former-commit-id: `7b539f9057`	2016-03-24 13:59:50 -04:00
Andrew Lin	1942bc690f	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Andrew Lin	68e7846d50	Merge pull request #32 from LuminosoInsight/thai-fix Leave Thai segments alone in the default regex Former-commit-id: `84497429e1`	2016-03-10 11:57:44 -05:00
Robyn Speer	f25985379c	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Robyn Speer	51e260b713	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	12e779fc79	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Robyn Speer	83559a53d4	add Zipf scale Former-commit-id: `9907948d11`	2016-01-21 14:07:01 -05:00
slibs63	927d4f45a4	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Robyn Speer	6eca3cff5a	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Robyn Speer	95cdf41fe8	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	2069e30c89	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Robyn Speer	883aa5baeb	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Andrew Lin	eae7b2752e	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files Former-commit-id: `f30efebba0`	2015-12-23 16:13:47 -05:00
Sara Jewett	42d209cbe2	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	682e08fee2	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Robyn Speer	064ee22a33	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Robyn Speer	ab8c2e2331	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Robyn Speer	23949a4512	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00

1 2 3 4 5 ...

503 Commits