wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	8e963dc312	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Robyn Speer	960dc437a2	update and clean up the tokenize() docstring Former-commit-id: `24b16d8a5d`	2015-09-24 17:47:16 -04:00
Robyn Speer	4a4534c466	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Robyn Speer	e15a231401	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	e27a75029d	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `cd0797e1c8`	2015-09-24 13:31:34 -04:00
Andrew Lin	bb4653f16f	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish Former-commit-id: `710eaabbe1`	2015-09-24 13:25:21 -04:00
Andrew Lin	e7d46fb104	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Robyn Speer	4d00f17477	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Andrew Lin	6b163e5772	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `2089090151` [formerly `db41bc7902`]. Former-commit-id: `bb70bdba58`	2015-09-23 13:02:40 -04:00
Robyn Speer	d215f79ea3	describe the use of `lang` in `read_values` Former-commit-id: `f224b8dbba`	2015-09-22 17:22:38 -04:00
Robyn Speer	e6e29a1c03	Make the jieba_deps comment make sense Former-commit-id: `7c12f2aca1`	2015-09-22 17:19:00 -04:00
Robyn Speer	b4628abb38	actually, still delay loading the Jieba tokenizer Former-commit-id: `48734d1a60`	2015-09-22 16:54:39 -04:00
Robyn Speer	13642d6a4d	replace the literal 10 with the constant INFERRED_SPACE_FACTOR Former-commit-id: `7a3ea2bf79`	2015-09-22 16:46:07 -04:00
Robyn Speer	01f9c07c33	remove unnecessary delayed loads in wordfreq.chinese Former-commit-id: `4a87890afd`	2015-09-22 16:42:13 -04:00
Robyn Speer	db30d09947	load the Chinese character mapping from a .msgpack.gz file Former-commit-id: `6cf4210187`	2015-09-22 16:32:33 -04:00
Robyn Speer	fe8a6b51e7	document what this file is for Former-commit-id: `06f8b29971`	2015-09-22 15:31:27 -04:00
Robyn Speer	6802a4f89d	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Robyn Speer	9a007b9948	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Robyn Speer	1adbb1aaf1	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Andrew Lin	66f1afe4d7	Merge pull request #26 from LuminosoInsight/greek-and-turkish Add SUBTLEX, support Turkish, expand Greek Former-commit-id: `acbb25e6f6`	2015-09-10 13:48:33 -04:00
Robyn Speer	c5d5b0b1fe	In ninja deps, remove 'startrow' as a variable Former-commit-id: `a4f8d11427`	2015-09-10 13:46:19 -04:00
Robyn Speer	acddc3ca05	fix spelling of Marc Former-commit-id: `2277ad3116`	2015-09-09 13:35:02 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	3dd70ed1c2	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Robyn Speer	59363c8c44	update comments in wordfreq_builder.config; remove unused 'version' Former-commit-id: `bc323eccaf`	2015-09-08 16:15:29 -04:00
Robyn Speer	48f9d4520c	sort Jieba wordlists consistently; update data files Former-commit-id: `0ab23f8a28`	2015-09-08 16:09:53 -04:00
Robyn Speer	4aef1dc338	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient. Former-commit-id: `bc8ebd23e9`	2015-09-08 14:46:04 -04:00
Robyn Speer	64b0b76ee1	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	c1f27d3095	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	e77c2dbca8	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	f7a4e2c444	update data files Former-commit-id: `25edaad962`	2015-09-04 17:00:55 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	032fea27c3	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	69d65dfda3	update data files (without the CLD2 fix yet) Former-commit-id: `a47497c908`	2015-09-04 14:58:20 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00

1 2 3 4 5 ...

494 Commits