Commit Graph

393 Commits

Author SHA1 Message Date
Rob Speer
24b16d8a5d update and clean up the tokenize() docstring 2015-09-24 17:47:16 -04:00
Rob Speer
2a84a926f5 test_chinese: fix typo in comment 2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
cd0797e1c8 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902.
2015-09-24 13:31:34 -04:00
Andrew Lin
710eaabbe1 Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
2015-09-24 13:25:21 -04:00
Andrew Lin
09597b7cf3 Revert a small syntax change introduced by a circular series of changes. 2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051 don't apply the inferred-space penalty to Japanese 2015-09-24 12:50:06 -04:00
Andrew Lin
bb70bdba58 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902.
2015-09-23 13:02:40 -04:00
Rob Speer
f224b8dbba describe the use of lang in read_values 2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1 Make the jieba_deps comment make sense 2015-09-22 17:19:00 -04:00
Rob Speer
48734d1a60 actually, still delay loading the Jieba tokenizer 2015-09-22 16:54:39 -04:00
Rob Speer
7a3ea2bf79 replace the literal 10 with the constant INFERRED_SPACE_FACTOR 2015-09-22 16:46:07 -04:00
Rob Speer
4a87890afd remove unnecessary delayed loads in wordfreq.chinese 2015-09-22 16:42:13 -04:00
Rob Speer
6cf4210187 load the Chinese character mapping from a .msgpack.gz file 2015-09-22 16:32:33 -04:00
Rob Speer
06f8b29971 document what this file is for 2015-09-22 15:31:27 -04:00
Rob Speer
5b918e7bb0 fix README conflict 2015-09-22 14:23:55 -04:00
Rob Speer
e8e6e0a231 refactor the tokenizer, add include_punctuation option 2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13 add external_wordlist option to tokenize 2015-09-10 18:09:41 -04:00
Rob Speer
3cb3061e06 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
5c8c36f4e3 Lower the frequency of phrases with inferred token boundaries 2015-09-10 14:16:22 -04:00
Andrew Lin
acbb25e6f6 Merge pull request #26 from LuminosoInsight/greek-and-turkish
Add SUBTLEX, support Turkish, expand Greek
2015-09-10 13:48:33 -04:00
Rob Speer
a4f8d11427 In ninja deps, remove 'startrow' as a variable 2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116 fix spelling of Marc 2015-09-09 13:35:02 -04:00
Rob Speer
354555514f fixes based on code review notes 2015-09-09 13:10:18 -04:00
Rob Speer
6502f15e9b fix SUBTLEX citations 2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc take out OpenSubtitles for Chinese 2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf update comments in wordfreq_builder.config; remove unused 'version' 2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28 sort Jieba wordlists consistently; update data files 2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d actually fix logic of apostrophe-fixing 2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213 fix logic of apostrophe-fixing 2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7 fix '--language' option definition 2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b Avoid Chinese tokenizer when building 2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33 language-specific frequency reading; fix 't in English 2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b WIP: fix apostrophe trimming 2015-09-08 12:28:28 -04:00
Rob Speer
d576e3294b update the README for Chinese 2015-09-05 03:42:54 -04:00
Rob Speer
2327f2e4d6 tokenize Chinese using jieba and our own frequencies 2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea WIP: Traditional Chinese 2015-09-04 18:52:37 -04:00
Rob Speer
3c3371a9ff add Polish and Swedish to README 2015-09-04 17:10:40 -04:00
Rob Speer
447d7e5134 add Polish and Swedish, which have sufficient data 2015-09-04 17:10:40 -04:00
Rob Speer
25edaad962 update data files 2015-09-04 17:00:55 -04:00
Rob Speer
fc93c8dc9c add tests for Turkish 2015-09-04 17:00:05 -04:00
Rob Speer
5c7a7ea83e We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3 remove subtlex-gr from README 2015-09-04 16:11:46 -04:00
Rob Speer
8196643509 add more citations 2015-09-04 15:57:40 -04:00
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908 update data files (without the CLD2 fix yet) 2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1 Exclude angle brackets from CLD2 detection 2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb update README with additional SUBTLEX support 2015-09-04 13:23:33 -04:00