wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	3f9c8449ff	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Rob Speer	a05a1c8d5c	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	f9238ac30f	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Rob Speer	f671a1db7f	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Rob Speer	847b85c5b8	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Rob Speer	e4f40a0ce9	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Rob Speer	99eac54b31	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Rob Speer	6171b3d066	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Rob Speer	b3e5d1c9e9	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Rob Speer	d376f4e2e2	fixes to tokenization	2016-12-13 14:43:29 -05:00
Rob Speer	bb5df3b074	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Rob Speer	24e26c4c1d	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Rob Speer	d18b149262	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	f285430c84	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Rob Speer	02e2430dfb	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	f6f0914e81	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Rob Speer	872eeb8848	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Rob Speer	0ba563c99c	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	91f7ef37eb	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	31be4fd309	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Andrew Lin	0250547c7a	Merge pull request #42 from LuminosoInsight/mecab-finder Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628	2016-08-08 16:00:39 -04:00
Rob Speer	8c79465d28	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Rob Speer	0a5e6bd87a	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Rob Speer	09a904c0fe	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Rob Speer	c6c44939e6	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Rob Speer	188654396a	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Rob Speer	1519df503c	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Rob Speer	410e8c255b	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Rob Speer	c1927732d3	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Rob Speer	1aa63bca6c	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Rob Speer	fcbdf560c2	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Rob Speer	99b627a300	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	ac24b8eab4	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Rob Speer	f539eecdd6	wordfreq_builder: Document the extract_reddit pipeline Former-commit-id: `88626aafee`	2016-06-02 15:19:25 -04:00
Andrew Lin	6eaae696fe	Merge pull request #35 from LuminosoInsight/big-list-test-fix fix Arabic test, where 'lol' is no longer common Former-commit-id: `3a6d985203`	2016-05-11 17:20:01 -04:00
Rob Speer	c3fd3bd734	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Andrew Lin	3c2a621743	Merge pull request #34 from LuminosoInsight/big-list wordfreq 1.4: some bigger wordlists, better use of language detection Former-commit-id: `e7b34fb655`	2016-05-11 16:27:51 -04:00
Rob Speer	4e4c77e7d7	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Rob Speer	c5bdc3c6bd	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Rob Speer	6f11256ed1	remove reddit_base_filename function Former-commit-id: `ced15d6eff`	2016-03-31 13:39:13 -04:00
Rob Speer	d924c8e2a5	use `path.stem` to make the Reddit filename prefix Former-commit-id: `ff1f0e4678`	2016-03-31 13:13:52 -04:00
Rob Speer	9adc5b92f8	rename max_size to max_words consistently Former-commit-id: `16059d3b9a`	2016-03-31 12:55:18 -04:00

1 2 3 4 5 ...

487 Commits