wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	4d00f17477	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Robyn Speer	9a007b9948	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Robyn Speer	1adbb1aaf1	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Robyn Speer	8795525372	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Andrew Lin	e88cf3fdaf	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	b0fac15f98	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Joshua Chin	af8050f1b8	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	e8fa25cb73	updated comments Former-commit-id: `131b916c57`	2015-07-17 14:50:12 -04:00
Andrew Lin	5c72e68b7e	Express the combining of word frequencies in an explicitly associative and commutative way. Former-commit-id: `32b4033d63`	2015-07-09 15:29:05 -04:00
Joshua Chin	d4409a2214	removed unused imports Former-commit-id: `b9578ae21e`	2015-07-07 16:21:22 -04:00
Joshua Chin	4b398fac65	updated minimum Former-commit-id: `59c03e2411`	2015-07-07 15:46:33 -04:00
Joshua Chin	b3a008f992	added arabic tests Former-commit-id: `f83d31a357`	2015-07-07 15:10:59 -04:00
Joshua Chin	21c809416d	changed default to minimum for word_frequency Former-commit-id: `9aa773aa2b`	2015-07-07 15:03:26 -04:00
Joshua Chin	9c741bb341	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Robyn Speer	9615b9f843	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Robyn Speer	a9b9b2f080	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Robyn Speer	4997d776b9	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Joshua Chin	fbd15947bb	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	9b02abb5ea	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	d10109bb38	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	fa89956df3	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	a0b7211451	updated tests for emoji splitting Former-commit-id: `3bcb3e84a1`	2015-06-25 11:25:51 -04:00
Robyn Speer	f3958d63ae	Switch to a more precise centibel scale. Former-commit-id: `7862a4d2b6`	2015-06-22 17:36:30 -04:00
Joshua Chin	4706a38c7a	updated test because the new tokenizer removes URLs Former-commit-id: `35f472fcf9`	2015-06-18 11:38:28 -04:00
Robyn Speer	860e929bf8	update Japanese data; test Japanese and token combining Former-commit-id: `611a6a35de`	2015-05-28 14:01:56 -04:00
Robyn Speer	4a865bfaec	remove old tests Former-commit-id: `410912d8f0`	2015-05-21 20:36:09 -04:00
Robyn Speer	26517c1b86	tests for new wordfreq with full coverage Former-commit-id: `df863a5169`	2015-05-21 20:34:17 -04:00
Robyn Speer	a06c3fc648	A different plan for the top-level word_frequency function. When, before, I was importing wordfreq.query at the top level, this created a dependency loop when installing wordfreq. The new top-level __init__.py provides just a `word_frequency` function, which imports the real function as needed and calls it. This should avoid the dependency loop, at the cost of making `wordfreq.word_frequency` slightly less efficient than `wordfreq.query.word_frequency`. Former-commit-id: `44ccf40742`	2014-02-24 18:03:31 -05:00
Andrew Lin	181e8e08fa	Remove the tests for metanl_word_frequency too. Doh. Former-commit-id: `68d262791c`	2013-11-11 13:21:25 -05:00
Robyn Speer	5f7c7e032c	Clear wordlists before inserting them; yell at Python 2 Former-commit-id: `823b3828cd`	2013-11-01 19:29:37 -04:00
Robyn Speer	5168da105a	make the tests less picky about numerical exactness Former-commit-id: `2b2bd943d2`	2013-10-31 15:43:19 -04:00
Robyn Speer	773f6b9843	The metanl scale is not what I thought it was. Former-commit-id: `0d2fb21726`	2013-10-31 14:38:01 -04:00
Robyn Speer	101e767ad9	When strings are inconsistent between py2 and 3, don't test them on py2.	2013-10-31 13:11:13 -04:00
Robyn Speer	ea5de7cb2a	Revise the build test to compare lengths of wordlists. The test currently fails on Python 3, for some strange reason.	2013-10-30 13:22:56 -04:00
Robyn Speer	68f7b25cf7	Change default values to offsets.	2013-10-29 18:06:47 -04:00
Robyn Speer	8a48e57749	now this package has tests	2013-10-29 17:21:55 -04:00

1 2

89 Commits