wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Author	SHA1	Message	Date
Rob Speer	45b9bcdbcb	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Rob Speer	846606d892	minor fixes to README	2018-02-28 16:14:50 -05:00
Rob Speer	ad677e12fd	Merge pull request #54 from LuminosoInsight/fix-deps Fix setup.py (version number and msgpack dependency)	2018-02-28 12:46:46 -08:00
Rob Speer	aadb19c9a3	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Rob Speer	db56528fb6	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Rob Speer	843ed92223	update citation to v1.7	2017-09-27 13:36:30 -04:00
Andrew Lin	721a1e9fd9	Merge pull request #51 from LuminosoInsight/version1.7 Version 1.7: update tokenization, update Wikipedia data, add languages	2017-09-08 17:02:05 -04:00
Rob Speer	61b2e4062d	remove unnecessary enumeration from top_n.py	2017-09-08 16:52:06 -04:00
Rob Speer	396b0f78df	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Rob Speer	e3352392cc	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Andrew Lin	baf6771e97	Merge pull request #49 from LuminosoInsight/restore-langcodes Use langcodes when tokenizing again	2017-05-10 16:20:06 -04:00
Rob Speer	37b4914970	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Rob Speer	d6cdef6039	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Rob Speer	97042e6f60	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	f28a193015	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	e21bcc2a58	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	21b331e898	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Rob Speer	b5b653f0a1	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Rob Speer	391a723662	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	68e4ce16cf	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	f03a37e19c	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	3f9c8449ff	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Rob Speer	a05a1c8d5c	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	f9238ac30f	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Rob Speer	f671a1db7f	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Rob Speer	847b85c5b8	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Rob Speer	e4f40a0ce9	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Rob Speer	99eac54b31	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Rob Speer	6171b3d066	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Rob Speer	b3e5d1c9e9	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Rob Speer	d376f4e2e2	fixes to tokenization	2016-12-13 14:43:29 -05:00
Rob Speer	bb5df3b074	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Rob Speer	24e26c4c1d	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Rob Speer	d18b149262	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Rob Speer	752c90c8a5	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Rob Speer	f285430c84	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Rob Speer	02e2430dfb	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Rob Speer	a92c805a82	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	f6f0914e81	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Rob Speer	872eeb8848	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Rob Speer	0ba563c99c	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	91f7ef37eb	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Rob Speer	fb5a55de7e	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Rob Speer	31be4fd309	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00

1 2 3 4 5 ...

561 Commits