wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 18:38:51 +00:00

Author	SHA1	Message	Date
Rob Speer	d6cdef6039	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Rob Speer	97042e6f60	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	f28a193015	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	e21bcc2a58	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	21b331e898	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Rob Speer	b5b653f0a1	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Rob Speer	391a723662	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	68e4ce16cf	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	f03a37e19c	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	3f9c8449ff	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Rob Speer	a05a1c8d5c	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	f9238ac30f	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Rob Speer	f671a1db7f	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Rob Speer	847b85c5b8	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Rob Speer	e4f40a0ce9	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Rob Speer	99eac54b31	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Rob Speer	6171b3d066	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Rob Speer	b3e5d1c9e9	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Rob Speer	d376f4e2e2	fixes to tokenization	2016-12-13 14:43:29 -05:00
Rob Speer	bb5df3b074	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Rob Speer	24e26c4c1d	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Rob Speer	d18b149262	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	f285430c84	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Rob Speer	02e2430dfb	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	f6f0914e81	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Rob Speer	872eeb8848	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Rob Speer	0ba563c99c	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	91f7ef37eb	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	31be4fd309	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Andrew Lin	0250547c7a	Merge pull request #42 from LuminosoInsight/mecab-finder Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628	2016-08-08 16:00:39 -04:00
Rob Speer	8c79465d28	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Rob Speer	0a5e6bd87a	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Rob Speer	09a904c0fe	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Rob Speer	c6c44939e6	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Rob Speer	188654396a	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Rob Speer	1519df503c	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Rob Speer	410e8c255b	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Rob Speer	c1927732d3	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Rob Speer	1aa63bca6c	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00
Rob Speer	fcbdf560c2	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Rob Speer	99b627a300	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00

1 2 3 4 5 ...

548 Commits