wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-25 18:18:53 +00:00

Author	SHA1	Message	Date
Rob Speer	24b16d8a5d	update and clean up the tokenize() docstring	2015-09-24 17:47:16 -04:00
Rob Speer	2a84a926f5	test_chinese: fix typo in comment	2015-09-24 13:41:11 -04:00
Rob Speer	cea2a61444	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py	2015-09-24 13:40:08 -04:00
Andrew Lin	cd0797e1c8	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-24 13:31:34 -04:00
Andrew Lin	710eaabbe1	Merge pull request #27 from LuminosoInsight/chinese-and-more Improve Chinese, Greek, English; add Turkish, Polish, Swedish	2015-09-24 13:25:21 -04:00
Andrew Lin	09597b7cf3	Revert a small syntax change introduced by a circular series of changes.	2015-09-24 13:24:11 -04:00
Rob Speer	db5eda6051	don't apply the inferred-space penalty to Japanese	2015-09-24 12:50:06 -04:00
Andrew Lin	bb70bdba58	Revert "Remove the no-longer-existent .txt files from the MANIFEST." This reverts commit `db41bc7902`.	2015-09-23 13:02:40 -04:00
Rob Speer	f224b8dbba	describe the use of `lang` in `read_values`	2015-09-22 17:22:38 -04:00
Rob Speer	7c12f2aca1	Make the jieba_deps comment make sense	2015-09-22 17:19:00 -04:00
Rob Speer	48734d1a60	actually, still delay loading the Jieba tokenizer	2015-09-22 16:54:39 -04:00
Rob Speer	7a3ea2bf79	replace the literal 10 with the constant INFERRED_SPACE_FACTOR	2015-09-22 16:46:07 -04:00
Rob Speer	4a87890afd	remove unnecessary delayed loads in wordfreq.chinese	2015-09-22 16:42:13 -04:00
Rob Speer	6cf4210187	load the Chinese character mapping from a .msgpack.gz file	2015-09-22 16:32:33 -04:00
Rob Speer	06f8b29971	document what this file is for	2015-09-22 15:31:27 -04:00
Rob Speer	5b918e7bb0	fix README conflict	2015-09-22 14:23:55 -04:00
Rob Speer	e8e6e0a231	refactor the tokenizer, add `include_punctuation` option	2015-09-15 13:26:09 -04:00
Rob Speer	669bd16c13	add `external_wordlist` option to tokenize	2015-09-10 18:09:41 -04:00
Rob Speer	3cb3061e06	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py	2015-09-10 15:27:33 -04:00
Rob Speer	5c8c36f4e3	Lower the frequency of phrases with inferred token boundaries	2015-09-10 14:16:22 -04:00
Andrew Lin	acbb25e6f6	Merge pull request #26 from LuminosoInsight/greek-and-turkish Add SUBTLEX, support Turkish, expand Greek	2015-09-10 13:48:33 -04:00
Rob Speer	a4f8d11427	In ninja deps, remove 'startrow' as a variable	2015-09-10 13:46:19 -04:00
Rob Speer	2277ad3116	fix spelling of Marc	2015-09-09 13:35:02 -04:00
Rob Speer	354555514f	fixes based on code review notes	2015-09-09 13:10:18 -04:00
Rob Speer	6502f15e9b	fix SUBTLEX citations	2015-09-08 17:45:25 -04:00
Rob Speer	d9c44d5fcc	take out OpenSubtitles for Chinese	2015-09-08 17:25:05 -04:00
Rob Speer	bc323eccaf	update comments in wordfreq_builder.config; remove unused 'version'	2015-09-08 16:15:29 -04:00
Rob Speer	0ab23f8a28	sort Jieba wordlists consistently; update data files	2015-09-08 16:09:53 -04:00
Rob Speer	bc8ebd23e9	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient.	2015-09-08 14:46:04 -04:00
Rob Speer	715361ca0d	actually fix logic of apostrophe-fixing	2015-09-08 13:50:34 -04:00
Rob Speer	c4c1af8213	fix logic of apostrophe-fixing	2015-09-08 13:47:58 -04:00
Rob Speer	912171f8e7	fix '--language' option definition	2015-09-08 13:27:20 -04:00
Rob Speer	77a9b5c55b	Avoid Chinese tokenizer when building	2015-09-08 12:59:03 -04:00
Rob Speer	9071defb33	language-specific frequency reading; fix 't in English	2015-09-08 12:49:21 -04:00
Rob Speer	20f2828d0a	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py	2015-09-08 12:29:00 -04:00
Rob Speer	e39d345c4b	WIP: fix apostrophe trimming	2015-09-08 12:28:28 -04:00
Rob Speer	d576e3294b	update the README for Chinese	2015-09-05 03:42:54 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	3c3371a9ff	add Polish and Swedish to README	2015-09-04 17:10:40 -04:00
Rob Speer	447d7e5134	add Polish and Swedish, which have sufficient data	2015-09-04 17:10:40 -04:00
Rob Speer	25edaad962	update data files	2015-09-04 17:00:55 -04:00
Rob Speer	fc93c8dc9c	add tests for Turkish	2015-09-04 17:00:05 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	8196643509	add more citations	2015-09-04 15:57:40 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	a47497c908	update data files (without the CLD2 fix yet)	2015-09-04 14:58:20 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	81bbe663fb	update README with additional SUBTLEX support	2015-09-04 13:23:33 -04:00

1 2 3 4 5 ...

393 Commits