wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	ff5a8f2a65	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Robyn Speer	596368ac6e	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	7f26270644	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Robyn Speer	7fabbfef31	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Robyn Speer	c0fbd844f6	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	976c8df0fd	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Robyn Speer	aa880bcd84	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Robyn Speer	e1d6e7d96f	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Andrew Lin	e4b32afa18	Merge pull request #42 from LuminosoInsight/mecab-finder Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628	2016-08-08 16:00:39 -04:00
Robyn Speer	88c93f6204	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Robyn Speer	6440d81676	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Robyn Speer	c11998e506	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Robyn Speer	bc1cfc35c8	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Robyn Speer	9e55f8fed1	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	875dd5669f	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	ce5a91d732	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Robyn Speer	15667ea023	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Robyn Speer	68c6d95131	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	8d09b68d37	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Andrew Lin	046ca4cda3	Merge pull request #35 from LuminosoInsight/big-list-test-fix fix Arabic test, where 'lol' is no longer common Former-commit-id: `3a6d985203`	2016-05-11 17:20:01 -04:00
Robyn Speer	c72326e4c0	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Andrew Lin	7a55e0ed86	Merge pull request #34 from LuminosoInsight/big-list wordfreq 1.4: some bigger wordlists, better use of language detection Former-commit-id: `e7b34fb655`	2016-05-11 16:27:51 -04:00
Robyn Speer	1ac6795709	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Robyn Speer	a0d93e0ce8	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Robyn Speer	5a37cc22c7	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Robyn Speer	797895047a	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Robyn Speer	a2bc90e430	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00
Robyn Speer	a9a4483ca3	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Robyn Speer	36885b5479	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Robyn Speer	cecf852040	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Robyn Speer	0c7527140c	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Robyn Speer	aa7802b552	oh look, more spam Former-commit-id: `08130908c7`	2016-03-24 18:42:47 -04:00
Robyn Speer	2840ca55aa	filter out downvoted Reddit posts Former-commit-id: `5b98794b86`	2016-03-24 18:05:13 -04:00
Robyn Speer	16841d4b0c	disregard Arabic Reddit spam Former-commit-id: `cfe68893fa`	2016-03-24 17:44:30 -04:00
Robyn Speer	034d8f540b	fix extraneous dot in intermediate filenames Former-commit-id: `6feae99381`	2016-03-24 16:52:44 -04:00
Robyn Speer	460fbb84fd	bump version to 1.4 Former-commit-id: `1df97a579e`	2016-03-24 16:29:29 -04:00
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Robyn Speer	f493d0eec4	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Robyn Speer	298cb69353	Merge pull request #33 from LuminosoInsight/bugfix Restore a missing comma. Former-commit-id: `7b539f9057`	2016-03-24 13:59:50 -04:00
Andrew Lin	1942bc690f	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Andrew Lin	68e7846d50	Merge pull request #32 from LuminosoInsight/thai-fix Leave Thai segments alone in the default regex Former-commit-id: `84497429e1`	2016-03-10 11:57:44 -05:00
Robyn Speer	f25985379c	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Robyn Speer	51e260b713	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00

1 2 3 4 5 ...

519 Commits