wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Rob Speer	a6bb267f89	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Rob Speer	bac3dcb620	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Rob Speer	e64f409c55	cache the language info (avoids 10x slowdown)	2018-03-09 14:54:03 -05:00
Rob Speer	11e758672e	avoid log spam: only warn about an unsupported language once	2018-03-09 11:50:15 -05:00
Rob Speer	49a603ea63	update the README	2018-03-08 18:16:15 -05:00
Rob Speer	92784d1768	wordlist updates from new exquisite-corpus	2018-03-08 18:16:00 -05:00
Rob Speer	1594ba3ad6	Test that we can leave the wordlist unspecified and get 'large' freqs	2018-03-08 18:09:57 -05:00
Rob Speer	47dac3b0b8	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Rob Speer	5a5acec9ff	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Rob Speer	67e4475763	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Rob Speer	a42cf312ef	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Rob Speer	45b9bcdbcb	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Rob Speer	846606d892	minor fixes to README	2018-02-28 16:14:50 -05:00
Rob Speer	ad677e12fd	Merge pull request #54 from LuminosoInsight/fix-deps Fix setup.py (version number and msgpack dependency)	2018-02-28 12:46:46 -08:00
Rob Speer	aadb19c9a3	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Rob Speer	db56528fb6	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Rob Speer	843ed92223	update citation to v1.7	2017-09-27 13:36:30 -04:00
Andrew Lin	721a1e9fd9	Merge pull request #51 from LuminosoInsight/version1.7 Version 1.7: update tokenization, update Wikipedia data, add languages	2017-09-08 17:02:05 -04:00
Rob Speer	61b2e4062d	remove unnecessary enumeration from top_n.py	2017-09-08 16:52:06 -04:00
Rob Speer	396b0f78df	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Rob Speer	e3352392cc	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Andrew Lin	baf6771e97	Merge pull request #49 from LuminosoInsight/restore-langcodes Use langcodes when tokenizing again	2017-05-10 16:20:06 -04:00
Rob Speer	37b4914970	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Rob Speer	d6cdef6039	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Rob Speer	97042e6f60	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	f28a193015	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	e21bcc2a58	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	21b331e898	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Rob Speer	b5b653f0a1	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Rob Speer	391a723662	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	68e4ce16cf	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	f03a37e19c	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	3f9c8449ff	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Rob Speer	a05a1c8d5c	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	f9238ac30f	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Rob Speer	f671a1db7f	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Rob Speer	847b85c5b8	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Rob Speer	e4f40a0ce9	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Rob Speer	99eac54b31	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Rob Speer	6171b3d066	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Rob Speer	b3e5d1c9e9	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Rob Speer	d376f4e2e2	fixes to tokenization	2016-12-13 14:43:29 -05:00
Rob Speer	bb5df3b074	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Rob Speer	24e26c4c1d	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00

1 2 3 4 5 ...

622 Commits