wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	12e779fc79	configuration that builds some larger lists Former-commit-id: `c1a12cebec`	2016-01-22 14:20:12 -05:00
Robyn Speer	83559a53d4	add Zipf scale Former-commit-id: `9907948d11`	2016-01-21 14:07:01 -05:00
slibs63	927d4f45a4	Merge pull request #30 from LuminosoInsight/add-reddit Add English data from Reddit corpus Former-commit-id: `d18fee3d78`	2016-01-14 15:52:39 -05:00
Robyn Speer	6eca3cff5a	fix documentation in wordfreq_builder.tokenizers Former-commit-id: `8ddc19a5ca`	2016-01-13 15:18:12 -05:00
Robyn Speer	95cdf41fe8	reformat some argparse argument definitions Former-commit-id: `511fcb6f91`	2016-01-13 12:05:07 -05:00
Robyn Speer	738243e244	build a bigger wordlist that we can optionally use Former-commit-id: `df8caaff7d`	2016-01-12 14:05:57 -05:00
Robyn Speer	2069e30c89	fix usage text: one comment, not one tweet Former-commit-id: `8d9668d8ab`	2016-01-12 13:05:38 -05:00
Robyn Speer	883aa5baeb	Separate tokens with spaces, not line breaks, in intermediate files Former-commit-id: `115c74583e`	2016-01-12 12:59:18 -05:00
Andrew Lin	eae7b2752e	Merge pull request #31 from LuminosoInsight/use_encoding Specify encoding when dealing with files Former-commit-id: `f30efebba0`	2015-12-23 16:13:47 -05:00
Sara Jewett	42d209cbe2	Specify encoding when dealing with files Former-commit-id: `37f9e12b93`	2015-12-23 15:49:13 -05:00
Robyn Speer	7d1719cfb4	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory. Former-commit-id: `973caca253`	2015-12-15 14:44:34 -05:00
Robyn Speer	f5e09f3f3d	gzip the intermediate step of Reddit word counting Former-commit-id: `9a5d9d66bb`	2015-12-09 13:30:08 -05:00
Robyn Speer	682e08fee2	no Thai because we can't tokenize it Former-commit-id: `95f53e295b`	2015-12-02 12:38:03 -05:00
Robyn Speer	064ee22a33	forgot about Italian Former-commit-id: `8f6cd0e57b`	2015-11-30 18:18:24 -05:00
Robyn Speer	ab8c2e2331	add tokenizer for Reddit Former-commit-id: `5ef807117d`	2015-11-30 18:16:54 -05:00
Robyn Speer	23949a4512	rebuild data files Former-commit-id: `2dcf368481`	2015-11-30 17:06:39 -05:00
Robyn Speer	6d2709f064	add word frequencies from the Reddit 2007-2015 corpus Former-commit-id: `b2d7546d2d`	2015-11-30 16:38:11 -05:00
Robyn Speer	eb08c0a951	add docstrings to chinese_ and japanese_tokenize Former-commit-id: `e1f7a1ccf3`	2015-10-27 13:23:56 -04:00
Lance Nathan	f4d865c0be	Merge pull request #28 from LuminosoInsight/chinese-external-wordlist Add some tokenizer options Former-commit-id: `ca00dfa1d9`	2015-10-19 18:21:52 -04:00
Robyn Speer	5fedd71a66	Define globals in relevant places Former-commit-id: `a6b6aa07e7`	2015-10-19 18:15:54 -04:00
Robyn Speer	91a81c1bde	clarify the tokenize docstring Former-commit-id: `bfc17fea9f`	2015-10-19 12:18:12 -04:00
Robyn Speer	c9693c9502	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Andrew Lin	6d5ead0b47	Merge pull request #29 from LuminosoInsight/code-review-notes-20150925 Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `15d99be21b`	2015-09-28 13:53:50 -04:00
Robyn Speer	f3f66508bd	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Robyn Speer	7494ae27a7	fix missing word in rules.ninja comment Former-commit-id: `9b1c4d66cd`	2015-09-24 17:56:06 -04:00
Robyn Speer	8e963dc312	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Robyn Speer	960dc437a2	update and clean up the tokenize() docstring Former-commit-id: `24b16d8a5d`	2015-09-24 17:47:16 -04:00
Robyn Speer	4a4534c466	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Robyn Speer	e15a231401	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	e27a75029d	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `cd0797e1c8`	2015-09-24 13:31:34 -04:00
Andrew Lin	bb4653f16f	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish Former-commit-id: `710eaabbe1`	2015-09-24 13:25:21 -04:00
Andrew Lin	e7d46fb104	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Robyn Speer	4d00f17477	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Andrew Lin	6b163e5772	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `bb70bdba58`	2015-09-23 13:02:40 -04:00
Robyn Speer	d215f79ea3	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Robyn Speer	e6e29a1c03	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Robyn Speer	b4628abb38	actually, still delay loading the Jieba tokenizer Former-commit-id: `48734d1a60`	2015-09-22 16:54:39 -04:00
Robyn Speer	13642d6a4d	replace the literal 10 with the constant INFERRED_SPACE_FACTOR Former-commit-id: `7a3ea2bf79`	2015-09-22 16:46:07 -04:00
Robyn Speer	01f9c07c33	remove unnecessary delayed loads in wordfreq.chinese Former-commit-id: `4a87890afd`	2015-09-22 16:42:13 -04:00
Robyn Speer	db30d09947	load the Chinese character mapping from a .msgpack.gz file Former-commit-id: `6cf4210187`	2015-09-22 16:32:33 -04:00
Robyn Speer	fe8a6b51e7	document what this file is for Former-commit-id: `06f8b29971`	2015-09-22 15:31:27 -04:00
Robyn Speer	6802a4f89d	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Robyn Speer	9a007b9948	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Robyn Speer	1adbb1aaf1	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Andrew Lin	66f1afe4d7	Merge pull request #26 from LuminosoInsight/greek-and-turkish Add SUBTLEX, support Turkish, expand Greek Former-commit-id: `acbb25e6f6`	2015-09-10 13:48:33 -04:00
Robyn Speer	c5d5b0b1fe	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Robyn Speer	acddc3ca05	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00

... 3 4 5 6 7 ...

620 Commits