wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Rob Speer	07f16e6f03	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback.	2016-02-22 14:32:59 -05:00
slibs63	d18fee3d78	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus	2016-01-14 15:52:39 -05:00
Rob Speer	8ddc19a5ca	fix documentation in wordfreq_builder.tokenizers	2016-01-13 15:18:12 -05:00
Rob Speer	511fcb6f91	reformat some argparse argument definitions	2016-01-13 12:05:07 -05:00
Rob Speer	8d9668d8ab	fix usage text: one comment, not one tweet	2016-01-12 13:05:38 -05:00
Rob Speer	115c74583e	Separate tokens with spaces, not line breaks, in intermediate files	2016-01-12 12:59:18 -05:00
Andrew Lin	f30efebba0	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files	2015-12-23 16:13:47 -05:00
Sara Jewett	37f9e12b93	Specify encoding when dealing with files	2015-12-23 15:49:13 -05:00
Rob Speer	973caca253	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory.	2015-12-15 14:44:34 -05:00
Rob Speer	9a5d9d66bb	gzip the intermediate step of Reddit word counting	2015-12-09 13:30:08 -05:00
Rob Speer	95f53e295b	no Thai because we can't tokenize it	2015-12-02 12:38:03 -05:00
Rob Speer	8f6cd0e57b	forgot about Italian	2015-11-30 18:18:24 -05:00
Rob Speer	5ef807117d	add tokenizer for Reddit	2015-11-30 18:16:54 -05:00
Rob Speer	2dcf368481	rebuild data files	2015-11-30 17:06:39 -05:00
Rob Speer	b2d7546d2d	add word frequencies from the Reddit 2007-2015 corpus	2015-11-30 16:38:11 -05:00
Rob Speer	e1f7a1ccf3	add docstrings to chinese_ and japanese_tokenize	2015-10-27 13:23:56 -04:00
Lance Nathan	ca00dfa1d9	Merge pull request #28 from LuminosoInsight/chinese-external-wordlist Add some tokenizer options	2015-10-19 18:21:52 -04:00
Rob Speer	a6b6aa07e7	Define globals in relevant places	2015-10-19 18:15:54 -04:00
Rob Speer	bfc17fea9f	clarify the tokenize docstring	2015-10-19 12:18:12 -04:00
Rob Speer	1793c1bb2e	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-28 14:34:59 -04:00
Andrew Lin	15d99be21b	Merge pull request #29 from LuminosoInsight/code-review-notes-20150925 Fix documentation and clean up, based on Sep 25 code review	2015-09-28 13:53:50 -04:00
Rob Speer	44b0c4f9ba	Fix documentation and clean up, based on Sep 25 code review	2015-09-28 12:58:46 -04:00
Rob Speer	9b1c4d66cd	fix missing word in rules.ninja comment	2015-09-24 17:56:06 -04:00
Rob Speer	b460eef444	describe optional dependencies better in the README	2015-09-24 17:54:52 -04:00
Rob Speer	24b16d8a5d	update and clean up the tokenize() docstring	2015-09-24 17:47:16 -04:00
Rob Speer	2a84a926f5	test_chinese: fix typo in comment	2015-09-24 13:41:11 -04:00
Rob Speer	cea2a61444	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-24 13:40:08 -04:00
Andrew Lin	cd0797e1c8	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-24 13:31:34 -04:00
Andrew Lin	710eaabbe1	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish	2015-09-24 13:25:21 -04:00
Andrew Lin	09597b7cf3	Revert a small syntax change introduced by a circular series of changes.	2015-09-24 13:24:11 -04:00
Rob Speer	db5eda6051	don't apply the inferred-space penalty to Japanese	2015-09-24 12:50:06 -04:00
Andrew Lin	bb70bdba58	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-23 13:02:40 -04:00
Rob Speer	f224b8dbba	describe the use of `lang` in `read_values`	2015-09-22 17:22:38 -04:00
Rob Speer	7c12f2aca1	Make the jieba_deps comment make sense	2015-09-22 17:19:00 -04:00
Rob Speer	48734d1a60	actually, still delay loading the Jieba tokenizer	2015-09-22 16:54:39 -04:00
Rob Speer	7a3ea2bf79	replace the literal 10 with the constant INFERRED_SPACE_FACTOR	2015-09-22 16:46:07 -04:00
Rob Speer	4a87890afd	remove unnecessary delayed loads in wordfreq.chinese	2015-09-22 16:42:13 -04:00
Rob Speer	6cf4210187	load the Chinese character mapping from a .msgpack.gz file	2015-09-22 16:32:33 -04:00
Rob Speer	06f8b29971	document what this file is for	2015-09-22 15:31:27 -04:00
Rob Speer	5b918e7bb0	fix README conflict	2015-09-22 14:23:55 -04:00
Rob Speer	e8e6e0a231	refactor the tokenizer, add `include_punctuation` option	2015-09-15 13:26:09 -04:00
Rob Speer	669bd16c13	add `external_wordlist` option to tokenize	2015-09-10 18:09:41 -04:00
Rob Speer	3cb3061e06	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py	2015-09-10 15:27:33 -04:00
Rob Speer	5c8c36f4e3	Lower the frequency of phrases with inferred token boundaries	2015-09-10 14:16:22 -04:00
Andrew Lin	acbb25e6f6	Merge pull request #26 from LuminosoInsight/greek-and-turkish Add SUBTLEX, support Turkish, expand Greek	2015-09-10 13:48:33 -04:00
Rob Speer	a4f8d11427	In ninja deps, remove 'startrow' as a variable	2015-09-10 13:46:19 -04:00
Rob Speer	2277ad3116	fix spelling of Marc	2015-09-09 13:35:02 -04:00
Rob Speer	354555514f	fixes based on code review notes	2015-09-09 13:10:18 -04:00
Rob Speer	6502f15e9b	fix SUBTLEX citations	2015-09-08 17:45:25 -04:00
Rob Speer	d9c44d5fcc	take out OpenSubtitles for Chinese	2015-09-08 17:25:05 -04:00

1 2 3 4 5 ...

417 Commits