wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	969a024dea	actually use the results of language-detection on Reddit Former-commit-id: `75a4a92110`	2016-03-24 16:27:24 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00
Robyn Speer	f493d0eec4	make max-words a real, documented parameter Former-commit-id: `178a8b1494`	2016-03-24 14:10:02 -04:00
Robyn Speer	298cb69353	Merge pull request #33 from LuminosoInsight/bugfix Restore a missing comma. Former-commit-id: `7b539f9057`	2016-03-24 13:59:50 -04:00
Andrew Lin	1942bc690f	Restore a missing comma. Former-commit-id: `38016cf62b`	2016-03-24 13:57:18 -04:00
Andrew Lin	68e7846d50	Merge pull request #32 from LuminosoInsight/thai-fix Leave Thai segments alone in the default regex Former-commit-id: `84497429e1`	2016-03-10 11:57:44 -05:00
Robyn Speer	f25985379c	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Robyn Speer	51e260b713	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	12e779fc79	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Robyn Speer	83559a53d4	add Zipf scale Former-commit-id: `9907948d11`	2016-01-21 14:07:01 -05:00
slibs63	927d4f45a4	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Robyn Speer	6eca3cff5a	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Robyn Speer	95cdf41fe8	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	2069e30c89	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Robyn Speer	883aa5baeb	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Andrew Lin	eae7b2752e	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files Former-commit-id: `f30efebba0`	2015-12-23 16:13:47 -05:00
Sara Jewett	42d209cbe2	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	682e08fee2	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Robyn Speer	064ee22a33	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Robyn Speer	ab8c2e2331	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Robyn Speer	23949a4512	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Robyn Speer	6d2709f064	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Robyn Speer	eb08c0a951	add docstrings to chinese_ and japanese_tokenize Former-commit-id: `e1f7a1ccf3`	2015-10-27 13:23:56 -04:00
Lance Nathan	f4d865c0be	Merge pull request #28 from LuminosoInsight/chinese-external-wordlist Add some tokenizer options Former-commit-id: `ca00dfa1d9`	2015-10-19 18:21:52 -04:00
Robyn Speer	5fedd71a66	Define globals in relevant places Former-commit-id: `a6b6aa07e7`	2015-10-19 18:15:54 -04:00
Robyn Speer	91a81c1bde	clarify the tokenize docstring Former-commit-id: `bfc17fea9f`	2015-10-19 12:18:12 -04:00
Robyn Speer	c9693c9502	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Andrew Lin	6d5ead0b47	Merge pull request #29 from LuminosoInsight/code-review-notes-20150925 Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `15d99be21b`	2015-09-28 13:53:50 -04:00
Robyn Speer	f3f66508bd	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Robyn Speer	7494ae27a7	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Robyn Speer	8e963dc312	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Robyn Speer	960dc437a2	update and clean up the tokenize() docstring Former-commit-id: `24b16d8a5d`	2015-09-24 17:47:16 -04:00
Robyn Speer	4a4534c466	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Robyn Speer	e15a231401	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	e27a75029d	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `cd0797e1c8`	2015-09-24 13:31:34 -04:00
Andrew Lin	bb4653f16f	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish Former-commit-id: `710eaabbe1`	2015-09-24 13:25:21 -04:00
Andrew Lin	e7d46fb104	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Robyn Speer	4d00f17477	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Andrew Lin	6b163e5772	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `bb70bdba58`	2015-09-23 13:02:40 -04:00
Robyn Speer	d215f79ea3	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Robyn Speer	e6e29a1c03	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Robyn Speer	b4628abb38	actually, still delay loading the Jieba tokenizer Former-commit-id: `48734d1a60`	2015-09-22 16:54:39 -04:00
Robyn Speer	13642d6a4d	replace the literal 10 with the constant INFERRED_SPACE_FACTOR Former-commit-id: `7a3ea2bf79`	2015-09-22 16:46:07 -04:00
Robyn Speer	01f9c07c33	remove unnecessary delayed loads in wordfreq.chinese Former-commit-id: `4a87890afd`	2015-09-22 16:42:13 -04:00
Robyn Speer	db30d09947	load the Chinese character mapping from a .msgpack.gz file Former-commit-id: `6cf4210187`	2015-09-22 16:32:33 -04:00
Robyn Speer	fe8a6b51e7	document what this file is for Former-commit-id: `06f8b29971`	2015-09-22 15:31:27 -04:00

... 3 4 5 6 7 ...

628 Commits