wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Rob Speer	1df97a579e	bump version to 1.4	2016-03-24 16:29:29 -04:00
Rob Speer	75a4a92110	actually use the results of language-detection on Reddit	2016-03-24 16:27:24 -04:00
Rob Speer	164a5b1a05	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py	2016-03-24 14:11:44 -04:00
Rob Speer	178a8b1494	make max-words a real, documented parameter	2016-03-24 14:10:02 -04:00
Rob Speer	7b539f9057	Merge pull request #33 from LuminosoInsight/bugfix Restore a missing comma.	2016-03-24 13:59:50 -04:00
Andrew Lin	38016cf62b	Restore a missing comma.	2016-03-24 13:57:18 -04:00
Andrew Lin	84497429e1	Merge pull request #32 from LuminosoInsight/thai-fix Leave Thai segments alone in the default regex	2016-03-10 11:57:44 -05:00
Rob Speer	4ec6b56faa	move Thai test to where it makes more sense	2016-03-10 11:56:15 -05:00
Rob Speer	07f16e6f03	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback.	2016-02-22 14:32:59 -05:00
Rob Speer	d79ee37da9	Add and document large wordlists	2016-01-22 16:23:43 -05:00
Rob Speer	c1a12cebec	configuration that builds some larger lists	2016-01-22 14:20:12 -05:00
Rob Speer	9907948d11	add Zipf scale	2016-01-21 14:07:01 -05:00
slibs63	d18fee3d78	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus	2016-01-14 15:52:39 -05:00
Rob Speer	8ddc19a5ca	fix documentation in wordfreq_builder.tokenizers	2016-01-13 15:18:12 -05:00
Rob Speer	511fcb6f91	reformat some argparse argument definitions	2016-01-13 12:05:07 -05:00
Rob Speer	df8caaff7d	build a bigger wordlist that we can optionally use	2016-01-12 14:05:57 -05:00
Rob Speer	8d9668d8ab	fix usage text: one comment, not one tweet	2016-01-12 13:05:38 -05:00
Rob Speer	115c74583e	Separate tokens with spaces, not line breaks, in intermediate files	2016-01-12 12:59:18 -05:00
Andrew Lin	f30efebba0	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files	2015-12-23 16:13:47 -05:00
Sara Jewett	37f9e12b93	Specify encoding when dealing with files	2015-12-23 15:49:13 -05:00
Rob Speer	973caca253	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory.	2015-12-15 14:44:34 -05:00
Rob Speer	9a5d9d66bb	gzip the intermediate step of Reddit word counting	2015-12-09 13:30:08 -05:00
Rob Speer	95f53e295b	no Thai because we can't tokenize it	2015-12-02 12:38:03 -05:00
Rob Speer	8f6cd0e57b	forgot about Italian	2015-11-30 18:18:24 -05:00
Rob Speer	5ef807117d	add tokenizer for Reddit	2015-11-30 18:16:54 -05:00
Rob Speer	2dcf368481	rebuild data files	2015-11-30 17:06:39 -05:00
Rob Speer	b2d7546d2d	add word frequencies from the Reddit 2007-2015 corpus	2015-11-30 16:38:11 -05:00
Rob Speer	e1f7a1ccf3	add docstrings to chinese_ and japanese_tokenize	2015-10-27 13:23:56 -04:00
Lance Nathan	ca00dfa1d9	Merge pull request #28 from LuminosoInsight/chinese-external-wordlist Add some tokenizer options	2015-10-19 18:21:52 -04:00
Rob Speer	a6b6aa07e7	Define globals in relevant places	2015-10-19 18:15:54 -04:00
Rob Speer	bfc17fea9f	clarify the tokenize docstring	2015-10-19 12:18:12 -04:00
Rob Speer	1793c1bb2e	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-28 14:34:59 -04:00
Andrew Lin	15d99be21b	Merge pull request #29 from LuminosoInsight/code-review-notes-20150925 Fix documentation and clean up, based on Sep 25 code review	2015-09-28 13:53:50 -04:00
Rob Speer	44b0c4f9ba	Fix documentation and clean up, based on Sep 25 code review	2015-09-28 12:58:46 -04:00
Rob Speer	9b1c4d66cd	fix missing word in rules.ninja comment	2015-09-24 17:56:06 -04:00
Rob Speer	b460eef444	describe optional dependencies better in the README	2015-09-24 17:54:52 -04:00
Rob Speer	24b16d8a5d	update and clean up the tokenize() docstring	2015-09-24 17:47:16 -04:00
Rob Speer	2a84a926f5	test_chinese: fix typo in comment	2015-09-24 13:41:11 -04:00
Rob Speer	cea2a61444	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-24 13:40:08 -04:00
Andrew Lin	cd0797e1c8	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-24 13:31:34 -04:00
Andrew Lin	710eaabbe1	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish	2015-09-24 13:25:21 -04:00
Andrew Lin	09597b7cf3	Revert a small syntax change introduced by a circular series of changes.	2015-09-24 13:24:11 -04:00
Rob Speer	db5eda6051	don't apply the inferred-space penalty to Japanese	2015-09-24 12:50:06 -04:00
Andrew Lin	bb70bdba58	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-23 13:02:40 -04:00
Rob Speer	f224b8dbba	describe the use of `lang` in `read_values`	2015-09-22 17:22:38 -04:00
Rob Speer	7c12f2aca1	Make the jieba_deps comment make sense	2015-09-22 17:19:00 -04:00
Rob Speer	48734d1a60	actually, still delay loading the Jieba tokenizer	2015-09-22 16:54:39 -04:00
Rob Speer	7a3ea2bf79	replace the literal 10 with the constant INFERRED_SPACE_FACTOR	2015-09-22 16:46:07 -04:00
Rob Speer	4a87890afd	remove unnecessary delayed loads in wordfreq.chinese	2015-09-22 16:42:13 -04:00
Rob Speer	6cf4210187	load the Chinese character mapping from a .msgpack.gz file	2015-09-22 16:32:33 -04:00

1 2 3 4 5 ...

429 Commits