wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Joshua Chin	d875aa8842	updated gen_regex to be run as script Former-commit-id: `22fbea4248`	2015-07-07 14:50:56 -04:00
Joshua Chin	d4685759e3	removed unused imports Former-commit-id: `f3f9a654ea`	2015-07-07 14:48:11 -04:00
Joshua Chin	3d221f0605	updated imports Former-commit-id: `f2b615b0f0`	2015-07-07 14:46:42 -04:00
Joshua Chin	c5135edd88	imports are already cached Former-commit-id: `b1cd2e01d3`	2015-07-07 14:44:50 -04:00
Joshua Chin	93681e43b3	factored out regex generation Former-commit-id: `476a909e4d`	2015-07-07 14:38:21 -04:00
Joshua Chin	9a513a2224	factored out emoji regex Former-commit-id: `781a072713`	2015-07-07 14:37:31 -04:00
Joshua Chin	9c741bb341	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Joshua Chin	1c365e6a50	Removes mention of Rosette from README	2015-07-07 10:32:16 -04:00
Robyn Speer	090cfa7088	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Robyn Speer	83939020d0	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Joshua Chin	427d5e7fc7	Merge pull request #5 from LuminosoInsight/add-twitter-build add 'twitter' as a final build, and a new build dir	2015-07-01 18:01:32 -04:00
Joshua Chin	98788628e9	Merge pull request #14 from LuminosoInsight/add-twitter-wordlists Add twitter wordlists Former-commit-id: `2acca8a27a`	2015-07-01 18:00:30 -04:00
Robyn Speer	9615b9f843	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Robyn Speer	215eafc50b	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Robyn Speer	3eb3e7c388	add 'twitter' as a final build, and a new build dir The `data/dist` directory is now a convenient place to find the final built files that can be copied into wordfreq.	2015-07-01 17:45:39 -04:00
Andrew Lin	9c461ae70e	Merge pull request #7 from LuminosoInsight/newbuild wordfreq 1.0b1 Former-commit-id: `dbc42830b4`	2015-07-01 16:51:39 -04:00
Joshua Chin	34a886feaa	Merge pull request #4 from LuminosoInsight/tokenization-cleanup remove wiki2tokens and tokenize_wikipedia	2015-07-01 11:34:30 -04:00
Joshua Chin	d8e3cc5383	Merge pull request #13 from LuminosoInsight/casefold-tokens Case-fold instead of just lowercasing tokens Former-commit-id: `95fc0c8e9d`	2015-07-01 11:34:02 -04:00
Robyn Speer	a9b9b2f080	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Robyn Speer	58c8bda21b	cope with occasional Unicode errors in the input	2015-06-30 17:05:40 -04:00
Robyn Speer	deed2f767c	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Robyn Speer	f17a04aa84	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Robyn Speer	4997d776b9	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Robyn Speer	4c2b766f46	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Robyn Speer	15865f43a7	Merge pull request #12 from LuminosoInsight/split-emoji Added the results of the new wordfreq_builder that splits emoji. Former-commit-id: `7d25627e43`	2015-06-30 11:32:40 -04:00
Joshua Chin	fbd15947bb	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	9b02abb5ea	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	d10109bb38	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	fa89956df3	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	b321c14b2c	updated wordlists Former-commit-id: `6f02dfc883`	2015-06-29 11:00:39 -04:00
Robyn Speer	96b75dcf2b	Merge pull request #11 from LuminosoInsight/split-emoji wordfreq now splits emoji from text Former-commit-id: `6c76942da2`	2015-06-26 12:12:51 -04:00
Joshua Chin	2eb1358c35	fixed tatweel comment Former-commit-id: `811c199e15`	2015-06-26 10:00:47 -04:00
Joshua Chin	da370511e3	optimized ranges and treats unassigned codepoints like their neighbors Former-commit-id: `91a14f6e6e`	2015-06-25 14:38:32 -04:00
Joshua Chin	59818f524f	changed mecab_tokenize to a global variable Former-commit-id: `5fc448bc60`	2015-06-25 13:58:30 -04:00
Joshua Chin	e4ba652556	added uncategorized unicodes as not punctuation Former-commit-id: `5cdac0c54e`	2015-06-25 13:53:54 -04:00
Joshua Chin	7c8266aeb7	removes combining marks from arabic words instead of treating them as punctuation Former-commit-id: `cebca52ea3`	2015-06-25 12:36:41 -04:00
Joshua Chin	60782d3796	removes arabic commas Former-commit-id: `83797bd276`	2015-06-25 12:02:59 -04:00
Joshua Chin	78bff813e3	only import mecab once Former-commit-id: `6e1f7e30c6`	2015-06-25 11:41:19 -04:00
Joshua Chin	a0b7211451	updated tests for emoji splitting Former-commit-id: `3bcb3e84a1`	2015-06-25 11:25:51 -04:00
Joshua Chin	99562d04f8	uses DATA_PATH instead of explicit path Former-commit-id: `35a80e5f50`	2015-06-25 10:42:59 -04:00
Joshua Chin	d4b5530d0e	now uses ranges Former-commit-id: `f3a365fda9`	2015-06-25 10:39:04 -04:00
Joshua Chin	44e7fb5b70	added docstrings Former-commit-id: `d10737bb51`	2015-06-24 17:45:29 -04:00
Joshua Chin	9349b53f40	removed duplicate of non_punct.txt Former-commit-id: `d9ebeca734`	2015-06-24 17:36:57 -04:00
Joshua Chin	0ddf0220fa	added non_punct to MANIFEST.in and moved it into data Former-commit-id: `b198f4b0c2`	2015-06-24 17:30:01 -04:00
Joshua Chin	7bbcffd848	removed old FIXME Former-commit-id: `d372b5618c`	2015-06-24 17:15:50 -04:00
Joshua Chin	fd2b0fc015	caches non_punct regex in non_punct.txt Former-commit-id: `f576ca58ae`	2015-06-24 17:11:50 -04:00
Robyn Speer	f8ac142bcf	Merge pull request #10 from LuminosoInsight/returns-none-bugfix word_frequency no longer returns None if it does not detect tokens Former-commit-id: `97bbb97f63`	2015-06-24 17:04:36 -04:00
Joshua Chin	90c41de48a	splits emoji from text Former-commit-id: `78c5b589c5`	2015-06-24 16:50:28 -04:00
Joshua Chin	af4b4e56c9	word_frequency no longer returns None if it does not detect tokens Former-commit-id: `2346a0535a`	2015-06-24 14:47:26 -04:00
Joshua Chin	57579f0e56	Merge pull request #3 from LuminosoInsight/centibels Switch to a centibel scale, add a header to the data	2015-06-23 12:59:20 -04:00

... 7 8 9 10 11 ...

603 Commits