Rob Speer
24b16d8a5d
update and clean up the tokenize() docstring
2015-09-24 17:47:16 -04:00
Rob Speer
2a84a926f5
test_chinese: fix typo in comment
2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
cd0797e1c8
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit db41bc7902
.
2015-09-24 13:31:34 -04:00
Andrew Lin
710eaabbe1
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
2015-09-24 13:25:21 -04:00
Andrew Lin
09597b7cf3
Revert a small syntax change introduced by a circular series of changes.
2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051
don't apply the inferred-space penalty to Japanese
2015-09-24 12:50:06 -04:00
Andrew Lin
bb70bdba58
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit db41bc7902
.
2015-09-23 13:02:40 -04:00
Rob Speer
f224b8dbba
describe the use of lang
in read_values
2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1
Make the jieba_deps comment make sense
2015-09-22 17:19:00 -04:00
Rob Speer
48734d1a60
actually, still delay loading the Jieba tokenizer
2015-09-22 16:54:39 -04:00
Rob Speer
7a3ea2bf79
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
2015-09-22 16:46:07 -04:00
Rob Speer
4a87890afd
remove unnecessary delayed loads in wordfreq.chinese
2015-09-22 16:42:13 -04:00
Rob Speer
6cf4210187
load the Chinese character mapping from a .msgpack.gz file
2015-09-22 16:32:33 -04:00
Rob Speer
06f8b29971
document what this file is for
2015-09-22 15:31:27 -04:00
Rob Speer
5b918e7bb0
fix README conflict
2015-09-22 14:23:55 -04:00
Rob Speer
e8e6e0a231
refactor the tokenizer, add include_punctuation
option
2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13
add external_wordlist
option to tokenize
2015-09-10 18:09:41 -04:00
Rob Speer
3cb3061e06
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
5c8c36f4e3
Lower the frequency of phrases with inferred token boundaries
2015-09-10 14:16:22 -04:00
Andrew Lin
acbb25e6f6
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
2015-09-10 13:48:33 -04:00
Rob Speer
a4f8d11427
In ninja deps, remove 'startrow' as a variable
2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116
fix spelling of Marc
2015-09-09 13:35:02 -04:00
Rob Speer
354555514f
fixes based on code review notes
2015-09-09 13:10:18 -04:00
Rob Speer
6502f15e9b
fix SUBTLEX citations
2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc
take out OpenSubtitles for Chinese
2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf
update comments in wordfreq_builder.config; remove unused 'version'
2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28
sort Jieba wordlists consistently; update data files
2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d
actually fix logic of apostrophe-fixing
2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213
fix logic of apostrophe-fixing
2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7
fix '--language' option definition
2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b
Avoid Chinese tokenizer when building
2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33
language-specific frequency reading; fix 't in English
2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b
WIP: fix apostrophe trimming
2015-09-08 12:28:28 -04:00
Rob Speer
d576e3294b
update the README for Chinese
2015-09-05 03:42:54 -04:00
Rob Speer
2327f2e4d6
tokenize Chinese using jieba and our own frequencies
2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea
WIP: Traditional Chinese
2015-09-04 18:52:37 -04:00
Rob Speer
3c3371a9ff
add Polish and Swedish to README
2015-09-04 17:10:40 -04:00
Rob Speer
447d7e5134
add Polish and Swedish, which have sufficient data
2015-09-04 17:10:40 -04:00
Rob Speer
25edaad962
update data files
2015-09-04 17:00:55 -04:00
Rob Speer
fc93c8dc9c
add tests for Turkish
2015-09-04 17:00:05 -04:00
Rob Speer
5c7a7ea83e
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3
remove subtlex-gr from README
2015-09-04 16:11:46 -04:00
Rob Speer
8196643509
add more citations
2015-09-04 15:57:40 -04:00
Rob Speer
77c60c29b0
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908
update data files (without the CLD2 fix yet)
2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1
Exclude angle brackets from CLD2 detection
2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb
update README with additional SUBTLEX support
2015-09-04 13:23:33 -04:00