wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	573ecc53d0	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Robyn Speer	3cb3c38f47	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Robyn Speer	86f22e8523	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Robyn Speer	48a5967e9a	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	23c7c8e936	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Robyn Speer	7dc3f03ebd	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Robyn Speer	de32a15b4f	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Robyn Speer	d66d04210f	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Robyn Speer	87b03325db	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Robyn Speer	c27e7f9b76	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Robyn Speer	6211b35fb3	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Robyn Speer	0aa7ad46ae	fixes to tokenization	2016-12-13 14:43:29 -05:00
Robyn Speer	d6d528de74	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Robyn Speer	a8e2fa5acf	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Robyn Speer	21a78f5eb9	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Robyn Speer	82eba05f2d	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Robyn Speer	4376636316	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Robyn Speer	ff5a8f2a65	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Robyn Speer	596368ac6e	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	7f26270644	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Robyn Speer	7fabbfef31	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Robyn Speer	c0fbd844f6	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	976c8df0fd	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Robyn Speer	aa880bcd84	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Robyn Speer	e1d6e7d96f	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Andrew Lin	e4b32afa18	Merge pull request #42 from LuminosoInsight/mecab-finder Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628	2016-08-08 16:00:39 -04:00
Robyn Speer	88c93f6204	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Robyn Speer	6440d81676	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Robyn Speer	c11998e506	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Robyn Speer	bc1cfc35c8	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Robyn Speer	9e55f8fed1	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	875dd5669f	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	ce5a91d732	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Robyn Speer	15667ea023	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Robyn Speer	68c6d95131	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	8d09b68d37	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Andrew Lin	046ca4cda3	Merge pull request #35 from LuminosoInsight/big-list-test-fix fix Arabic test, where 'lol' is no longer common Former-commit-id: `3a6d985203`	2016-05-11 17:20:01 -04:00
Robyn Speer	c72326e4c0	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Andrew Lin	7a55e0ed86	Merge pull request #34 from LuminosoInsight/big-list wordfreq 1.4: some bigger wordlists, better use of language detection Former-commit-id: `e7b34fb655`	2016-05-11 16:27:51 -04:00
Robyn Speer	1ac6795709	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Robyn Speer	a0d93e0ce8	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Robyn Speer	5a37cc22c7	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Robyn Speer	797895047a	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Robyn Speer	a2bc90e430	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00

... 2 3 4 5 6 ...

637 Commits