wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Rob Speer	c2eab6881e	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Rob Speer	a32162c04f	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Rob Speer	23c5c4adca	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Rob Speer	3b95d349e0	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Rob Speer	35ee23591e	add Zipf scale Former-commit-id: `9907948d11`	2016-01-21 14:07:01 -05:00
slibs63	258f5088e9	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Rob Speer	ee8cfb5a50	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Rob Speer	56f830d678	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Rob Speer	f4761029d0	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Rob Speer	83bd019efe	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Rob Speer	1d3485c855	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Andrew Lin	c9f679a7a3	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files Former-commit-id: `f30efebba0`	2015-12-23 16:13:47 -05:00
Sara Jewett	7b6f88b059	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Rob Speer	6d62a8ff51	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Rob Speer	4e985e3bca	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Rob Speer	dc94222d7d	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Rob Speer	237fabb4c5	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Rob Speer	6caa9ca443	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Rob Speer	9a1b00ba0c	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Rob Speer	d1b667909d	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Rob Speer	49b8ba4be9	add docstrings to chinese_ and japanese_tokenize Former-commit-id: `e1f7a1ccf3`	2015-10-27 13:23:56 -04:00
Lance Nathan	f47249064f	Merge pull request #28 from LuminosoInsight/chinese-external-wordlist Add some tokenizer options Former-commit-id: `ca00dfa1d9`	2015-10-19 18:21:52 -04:00
Rob Speer	668a985969	Define globals in relevant places Former-commit-id: `a6b6aa07e7`	2015-10-19 18:15:54 -04:00
Rob Speer	f255eb5bd8	clarify the tokenize docstring Former-commit-id: `bfc17fea9f`	2015-10-19 12:18:12 -04:00
Rob Speer	8fea2ca181	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Andrew Lin	d8422852f4	Merge pull request #29 from LuminosoInsight/code-review-notes-20150925 Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `15d99be21b`	2015-09-28 13:53:50 -04:00
Rob Speer	3bd1fe2fe6	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Rob Speer	7435c8f57a	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Rob Speer	7c596de98a	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Rob Speer	28381d5a51	update and clean up the tokenize() docstring Former-commit-id: `24b16d8a5d`	2015-09-24 17:47:16 -04:00
Rob Speer	f89ac5e400	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Rob Speer	faf66e9b08	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	c53bb06988	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `65d6645e81` [formerly `db41bc7902`]. Former-commit-id: `cd0797e1c8`	2015-09-24 13:31:34 -04:00
Andrew Lin	566a62abd5	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish Former-commit-id: `710eaabbe1`	2015-09-24 13:25:21 -04:00
Andrew Lin	ee6df56514	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Rob Speer	1b7117952b	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Andrew Lin	4ccfcdc1bd	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `65d6645e81` [formerly `db41bc7902`]. Former-commit-id: `bb70bdba58`	2015-09-23 13:02:40 -04:00
Rob Speer	88deef24f6	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Rob Speer	7cb310b28e	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Rob Speer	d68dd9f568	actually, still delay loading the Jieba tokenizer Former-commit-id: `48734d1a60`	2015-09-22 16:54:39 -04:00
Rob Speer	0e4daa8472	replace the literal 10 with the constant INFERRED_SPACE_FACTOR Former-commit-id: `7a3ea2bf79`	2015-09-22 16:46:07 -04:00
Rob Speer	5929975338	remove unnecessary delayed loads in wordfreq.chinese Former-commit-id: `4a87890afd`	2015-09-22 16:42:13 -04:00
Rob Speer	42ccba4fa6	load the Chinese character mapping from a .msgpack.gz file Former-commit-id: `6cf4210187`	2015-09-22 16:32:33 -04:00
Rob Speer	e12a42f38a	document what this file is for Former-commit-id: `06f8b29971`	2015-09-22 15:31:27 -04:00
Rob Speer	76c4a8975a	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Rob Speer	963e0ff785	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Rob Speer	e3a79ab8c9	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Rob Speer	7f92557a58	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Rob Speer	a13f459f88	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Andrew Lin	800039f0f8	Merge pull request #26 from LuminosoInsight/greek-and-turkish Add SUBTLEX, support Turkish, expand Greek Former-commit-id: `acbb25e6f6`	2015-09-10 13:48:33 -04:00

... 3 4 5 6 7 ...

622 Commits