wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Author	SHA1	Message	Date
Rob Speer	4ec6b56faa	move Thai test to where it makes more sense	2016-03-10 11:56:15 -05:00
Rob Speer	07f16e6f03	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback.	2016-02-22 14:32:59 -05:00
Rob Speer	2a84a926f5	test_chinese: fix typo in comment	2015-09-24 13:41:11 -04:00
Rob Speer	cea2a61444	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-24 13:40:08 -04:00
Andrew Lin	09597b7cf3	Revert a small syntax change introduced by a circular series of changes.	2015-09-24 13:24:11 -04:00
Rob Speer	db5eda6051	don't apply the inferred-space penalty to Japanese	2015-09-24 12:50:06 -04:00
Rob Speer	e8e6e0a231	refactor the tokenizer, add `include_punctuation` option	2015-09-15 13:26:09 -04:00
Rob Speer	669bd16c13	add `external_wordlist` option to tokenize	2015-09-10 18:09:41 -04:00
Rob Speer	5c8c36f4e3	Lower the frequency of phrases with inferred token boundaries	2015-09-10 14:16:22 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	fc93c8dc9c	add tests for Turkish	2015-09-04 17:00:05 -04:00
Rob Speer	95998205ad	Use the regex implementation of Unicode segmentation	2015-08-24 17:11:08 -04:00
Andrew Lin	41e1dd41d8	Document the NFKC-normalized ligature in the Arabic test.	2015-08-03 11:09:44 -04:00
Andrew Lin	66c69e6fac	Switch to more explanatory Unicode escapes when testing NFKC normalization.	2015-07-31 19:23:42 -04:00
Joshua Chin	173278fdd3	ensure removal of tatweels (hopefully)	2015-07-20 16:48:36 -04:00
Joshua Chin	131b916c57	updated comments	2015-07-17 14:50:12 -04:00
Andrew Lin	32b4033d63	Express the combining of word frequencies in an explicitly associative and commutative way.	2015-07-09 15:29:05 -04:00
Joshua Chin	b9578ae21e	removed unused imports	2015-07-07 16:21:22 -04:00
Joshua Chin	59c03e2411	updated minimum	2015-07-07 15:46:33 -04:00
Joshua Chin	f83d31a357	added arabic tests	2015-07-07 15:10:59 -04:00
Joshua Chin	9aa773aa2b	changed default to minimum for word_frequency	2015-07-07 15:03:26 -04:00
Joshua Chin	ca66a5f883	updated tests	2015-07-07 14:13:28 -04:00
Rob Speer	14cb408100	test and document new twitter wordlists	2015-07-01 17:53:38 -04:00
Rob Speer	f9a9ee7a82	update data using new build	2015-07-01 11:18:39 -04:00
Rob Speer	638467f600	case-fold instead of just lowercasing tokens	2015-06-30 15:14:02 -04:00
Joshua Chin	bbf7b9de34	revert changes to test_not_really_random	2015-06-30 11:29:14 -04:00
Joshua Chin	a49b66880e	changed english test to take random ascii words	2015-06-29 11:05:01 -04:00
Joshua Chin	5ed03b006c	changed japanese test because the most common japanese ascii word keeps changing	2015-06-29 11:04:19 -04:00
Joshua Chin	17f11ebd26	Japanese people do not 'lol', they 'w'	2015-06-29 11:01:13 -04:00
Joshua Chin	3bcb3e84a1	updated tests for emoji splitting	2015-06-25 11:25:51 -04:00
Rob Speer	7862a4d2b6	Switch to a more precise centibel scale.	2015-06-22 17:36:30 -04:00
Joshua Chin	35f472fcf9	updated test because the new tokenizer removes URLs	2015-06-18 11:38:28 -04:00
Rob Speer	611a6a35de	update Japanese data; test Japanese and token combining	2015-05-28 14:01:56 -04:00
Rob Speer	410912d8f0	remove old tests	2015-05-21 20:36:09 -04:00
Rob Speer	df863a5169	tests for new wordfreq with full coverage	2015-05-21 20:34:17 -04:00
Rob Speer	44ccf40742	A different plan for the top-level word_frequency function. When, before, I was importing wordfreq.query at the top level, this created a dependency loop when installing wordfreq. The new top-level __init__.py provides just a `word_frequency` function, which imports the real function as needed and calls it. This should avoid the dependency loop, at the cost of making `wordfreq.word_frequency` slightly less efficient than `wordfreq.query.word_frequency`.	2014-02-24 18:03:31 -05:00
Andrew Lin	68d262791c	Remove the tests for metanl_word_frequency too. Doh.	2013-11-11 13:21:25 -05:00
Rob Speer	823b3828cd	Clear wordlists before inserting them; yell at Python 2	2013-11-01 19:29:37 -04:00
Rob Speer	2b2bd943d2	make the tests less picky about numerical exactness	2013-10-31 15:43:19 -04:00
Rob Speer	0d2fb21726	The metanl scale is not what I thought it was.	2013-10-31 14:38:01 -04:00
Rob Speer	2cf812a64e	When strings are inconsistent between py2 and 3, don't test them on py2.	2013-10-31 13:11:13 -04:00
Rob Speer	3063b3915a	Revise the build test to compare lengths of wordlists. The test currently fails on Python 3, for some strange reason.	2013-10-30 13:22:56 -04:00
Rob Speer	be183b2564	Change default values to offsets.	2013-10-29 18:06:47 -04:00
Rob Speer	2907f7f077	now this package has tests	2013-10-29 17:21:55 -04:00

44 Commits