wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Rob Speer	c3fd3bd734	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Rob Speer	c2eab6881e	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Rob Speer	a32162c04f	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Rob Speer	f89ac5e400	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Rob Speer	faf66e9b08	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	ee6df56514	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Rob Speer	1b7117952b	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Rob Speer	963e0ff785	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Rob Speer	e3a79ab8c9	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Rob Speer	a13f459f88	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Rob Speer	91cc82f76d	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Rob Speer	63295fc397	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Andrew Lin	10bddfe09f	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	a5553676e4	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Joshua Chin	423b2d8443	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	d0e0287d71	updated comments Former-commit-id: `131b916c57`	2015-07-17 14:50:12 -04:00
Andrew Lin	081fde93e3	Express the combining of word frequencies in an explicitly associative and commutative way. Former-commit-id: `32b4033d63`	2015-07-09 15:29:05 -04:00
Joshua Chin	b145e02ce4	removed unused imports Former-commit-id: `b9578ae21e`	2015-07-07 16:21:22 -04:00
Joshua Chin	927aaae920	updated minimum Former-commit-id: `59c03e2411`	2015-07-07 15:46:33 -04:00
Joshua Chin	53323f8ea7	added arabic tests Former-commit-id: `f83d31a357`	2015-07-07 15:10:59 -04:00
Joshua Chin	d88470df4e	changed default to minimum for word_frequency Former-commit-id: `9aa773aa2b`	2015-07-07 15:03:26 -04:00
Joshua Chin	54f66d49ee	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Rob Speer	3bf59fec57	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Rob Speer	b84ba2bc2e	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Rob Speer	8cac81666a	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Joshua Chin	5cc3dce834	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	53c558ca90	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	ea5470a85a	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	000491c7cc	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	09966989fb	updated tests for emoji splitting Former-commit-id: `3bcb3e84a1`	2015-06-25 11:25:51 -04:00
Rob Speer	b4600c9bd1	Switch to a more precise centibel scale. Former-commit-id: `7862a4d2b6`	2015-06-22 17:36:30 -04:00
Joshua Chin	529aa9afde	updated test because the new tokenizer removes URLs Former-commit-id: `35f472fcf9`	2015-06-18 11:38:28 -04:00
Rob Speer	1f41cb083c	update Japanese data; test Japanese and token combining Former-commit-id: `611a6a35de`	2015-05-28 14:01:56 -04:00
Rob Speer	a1c31d3390	remove old tests Former-commit-id: `410912d8f0`	2015-05-21 20:36:09 -04:00
Rob Speer	5b4107bd1d	tests for new wordfreq with full coverage Former-commit-id: `df863a5169`	2015-05-21 20:34:17 -04:00
Rob Speer	c7c8078883	A different plan for the top-level word_frequency function. When, before, I was importing wordfreq.query at the top level, this created a dependency loop when installing wordfreq. The new top-level __init__.py provides just a `word_frequency` function, which imports the real function as needed and calls it. This should avoid the dependency loop, at the cost of making `wordfreq.word_frequency` slightly less efficient than `wordfreq.query.word_frequency`. Former-commit-id: `44ccf40742`	2014-02-24 18:03:31 -05:00
Andrew Lin	3340367519	Remove the tests for metanl_word_frequency too. Doh. Former-commit-id: `68d262791c`	2013-11-11 13:21:25 -05:00
Rob Speer	1edee91b05	Clear wordlists before inserting them; yell at Python 2 Former-commit-id: `823b3828cd`	2013-11-01 19:29:37 -04:00
Rob Speer	280eca22ce	make the tests less picky about numerical exactness Former-commit-id: `2b2bd943d2`	2013-10-31 15:43:19 -04:00
Rob Speer	def8a71b44	The metanl scale is not what I thought it was. Former-commit-id: `0d2fb21726`	2013-10-31 14:38:01 -04:00
Rob Speer	2cf812a64e	When strings are inconsistent between py2 and 3, don't test them on py2.	2013-10-31 13:11:13 -04:00
Rob Speer	3063b3915a	Revise the build test to compare lengths of wordlists. The test currently fails on Python 3, for some strange reason.	2013-10-30 13:22:56 -04:00
Rob Speer	be183b2564	Change default values to offsets.	2013-10-29 18:06:47 -04:00
Rob Speer	2907f7f077	now this package has tests	2013-10-29 17:21:55 -04:00

45 Commits