wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Joshua Chin	7e9338f87e	cleaned up gen regex Former-commit-id: `27ea107e6f`	2015-07-07 16:00:24 -04:00
Joshua Chin	a72b4abb48	revert to using global mecab_tokenize variable Former-commit-id: `189a5b9cd6`	2015-07-07 15:47:37 -04:00
Joshua Chin	4b398fac65	updated minimum Former-commit-id: `59c03e2411`	2015-07-07 15:46:33 -04:00
Joshua Chin	4389422958	updated emoji parser Former-commit-id: `f04ca8fc9e`	2015-07-07 15:43:34 -04:00
Joshua Chin	94ba6e650f	updated docstring Former-commit-id: `9b851f3afe`	2015-07-07 15:33:51 -04:00
Joshua Chin	16a785a5e4	factored out range loading Former-commit-id: `32803b235b`	2015-07-07 15:33:36 -04:00
Joshua Chin	a87d84b796	fixed spacing Former-commit-id: `ae4699029d`	2015-07-07 15:23:15 -04:00
Joshua Chin	cb4e444723	fixed gen_regex Former-commit-id: `5510fce675`	2015-07-07 15:22:04 -04:00
Joshua Chin	b3a008f992	added arabic tests Former-commit-id: `f83d31a357`	2015-07-07 15:10:59 -04:00
Joshua Chin	21c809416d	changed default to minimum for word_frequency Former-commit-id: `9aa773aa2b`	2015-07-07 15:03:26 -04:00
Joshua Chin	7279ab1e80	added docstring to top_n_list Former-commit-id: `0b25caaf24`	2015-07-07 15:01:39 -04:00
Joshua Chin	a408e6f96a	fix grammar Former-commit-id: `bd172594d3`	2015-07-07 14:59:28 -04:00
Joshua Chin	02526f658c	updated _emoji_char_class docstring Former-commit-id: `10b5727725`	2015-07-07 14:58:50 -04:00
Joshua Chin	3e7f812e0c	removed intermediate lists Former-commit-id: `5342ea3033`	2015-07-07 14:57:45 -04:00
Joshua Chin	25336c45b4	updated number of words to 5 Former-commit-id: `4b49b1a547`	2015-07-07 14:56:40 -04:00
Joshua Chin	e1435136e3	updated word_frequency docstring Former-commit-id: `4304a400f7`	2015-07-07 14:56:12 -04:00
Joshua Chin	a9a48229e0	run cB_to_freq only once per bucket Former-commit-id: `5e8ef19321`	2015-07-07 14:55:13 -04:00
Joshua Chin	8ab615dde9	use itertools.chain Former-commit-id: `6a40e63060`	2015-07-07 14:54:19 -04:00
Joshua Chin	0308246e72	fixed Error string Former-commit-id: `bbdc064528`	2015-07-07 14:51:46 -04:00
Joshua Chin	d875aa8842	updated gen_regex to be run as script Former-commit-id: `22fbea4248`	2015-07-07 14:50:56 -04:00
Joshua Chin	d4685759e3	removed unused imports Former-commit-id: `f3f9a654ea`	2015-07-07 14:48:11 -04:00
Joshua Chin	3d221f0605	updated imports Former-commit-id: `f2b615b0f0`	2015-07-07 14:46:42 -04:00
Joshua Chin	c5135edd88	imports are already cached Former-commit-id: `b1cd2e01d3`	2015-07-07 14:44:50 -04:00
Joshua Chin	93681e43b3	factored out regex generation Former-commit-id: `476a909e4d`	2015-07-07 14:38:21 -04:00
Joshua Chin	9a513a2224	factored out emoji regex Former-commit-id: `781a072713`	2015-07-07 14:37:31 -04:00
Joshua Chin	9c741bb341	updated tests Former-commit-id: `ca66a5f883`	2015-07-07 14:13:28 -04:00
Joshua Chin	1c365e6a50	Removes mention of Rosette from README	2015-07-07 10:32:16 -04:00
Robyn Speer	090cfa7088	declare 'mecab' as an extra Former-commit-id: `a69ea5ad52`	2015-07-02 17:11:51 -04:00
Robyn Speer	83939020d0	declare that tests require mecab-python3 Former-commit-id: `7b4ebd1805`	2015-07-02 11:29:11 -04:00
Joshua Chin	427d5e7fc7	Merge pull request #5 from LuminosoInsight/add-twitter-build add 'twitter' as a final build, and a new build dir	2015-07-01 18:01:32 -04:00
Joshua Chin	98788628e9	Merge pull request #14 from LuminosoInsight/add-twitter-wordlists Add twitter wordlists Former-commit-id: `2acca8a27a`	2015-07-01 18:00:30 -04:00
Robyn Speer	9615b9f843	test and document new twitter wordlists Former-commit-id: `14cb408100`	2015-07-01 17:53:38 -04:00
Robyn Speer	215eafc50b	add Twitter-specific wordlists Former-commit-id: `7e3066d3fc`	2015-07-01 17:49:33 -04:00
Robyn Speer	3eb3e7c388	add 'twitter' as a final build, and a new build dir The `data/dist` directory is now a convenient place to find the final built files that can be copied into wordfreq.	2015-07-01 17:45:39 -04:00
Andrew Lin	9c461ae70e	Merge pull request #7 from LuminosoInsight/newbuild wordfreq 1.0b1 Former-commit-id: `dbc42830b4`	2015-07-01 16:51:39 -04:00
Joshua Chin	34a886feaa	Merge pull request #4 from LuminosoInsight/tokenization-cleanup remove wiki2tokens and tokenize_wikipedia	2015-07-01 11:34:30 -04:00
Joshua Chin	d8e3cc5383	Merge pull request #13 from LuminosoInsight/casefold-tokens Case-fold instead of just lowercasing tokens Former-commit-id: `95fc0c8e9d`	2015-07-01 11:34:02 -04:00
Robyn Speer	a9b9b2f080	update data using new build Former-commit-id: `f9a9ee7a82`	2015-07-01 11:18:39 -04:00
Robyn Speer	58c8bda21b	cope with occasional Unicode errors in the input	2015-06-30 17:05:40 -04:00
Robyn Speer	deed2f767c	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Robyn Speer	f17a04aa84	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Robyn Speer	4997d776b9	case-fold instead of just lowercasing tokens Former-commit-id: `638467f600`	2015-06-30 15:14:02 -04:00
Robyn Speer	4c2b766f46	bump version number Former-commit-id: `053f372ebc`	2015-06-30 14:54:13 -04:00
Robyn Speer	15865f43a7	Merge pull request #12 from LuminosoInsight/split-emoji Added the results of the new wordfreq_builder that splits emoji. Former-commit-id: `7d25627e43`	2015-06-30 11:32:40 -04:00
Joshua Chin	fbd15947bb	revert changes to test_not_really_random Former-commit-id: `bbf7b9de34`	2015-06-30 11:29:14 -04:00
Joshua Chin	9b02abb5ea	changed english test to take random ascii words Former-commit-id: `a49b66880e`	2015-06-29 11:05:01 -04:00
Joshua Chin	d10109bb38	changed japanese test because the most common japanese ascii word keeps changing Former-commit-id: `5ed03b006c`	2015-06-29 11:04:19 -04:00
Joshua Chin	fa89956df3	Japanese people do not 'lol', they 'w' Former-commit-id: `17f11ebd26`	2015-06-29 11:01:13 -04:00
Joshua Chin	b321c14b2c	updated wordlists Former-commit-id: `6f02dfc883`	2015-06-29 11:00:39 -04:00
Robyn Speer	96b75dcf2b	Merge pull request #11 from LuminosoInsight/split-emoji wordfreq now splits emoji from text Former-commit-id: `6c76942da2`	2015-06-26 12:12:51 -04:00

... 7 8 9 10 11 ...

622 Commits