Robyn Speer
8e963dc312
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
960dc437a2
update and clean up the tokenize() docstring
...
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Robyn Speer
4a4534c466
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e27a75029d
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
bb4653f16f
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
e7d46fb104
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
6b163e5772
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Robyn Speer
d215f79ea3
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
b4628abb38
actually, still delay loading the Jieba tokenizer
...
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Robyn Speer
13642d6a4d
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
...
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Robyn Speer
01f9c07c33
remove unnecessary delayed loads in wordfreq.chinese
...
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Robyn Speer
db30d09947
load the Chinese character mapping from a .msgpack.gz file
...
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Robyn Speer
fe8a6b51e7
document what this file is for
...
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Robyn Speer
6802a4f89d
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Robyn Speer
9a007b9948
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Robyn Speer
1adbb1aaf1
add external_wordlist
option to tokenize
...
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Robyn Speer
f2be213933
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
f0c7c3a02c
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Andrew Lin
66f1afe4d7
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
Former-commit-id: acbb25e6f6
2015-09-10 13:48:33 -04:00
Robyn Speer
c5d5b0b1fe
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
3dd70ed1c2
fix SUBTLEX citations
...
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Robyn Speer
1d3521dfda
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
c1f27d3095
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
e77c2dbca8
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Robyn Speer
5b9b2d2d02
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
f7a4e2c444
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Robyn Speer
4704131e13
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
a75a95658b
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
032fea27c3
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Robyn Speer
8277b34571
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
69d65dfda3
update data files (without the CLD2 fix yet)
...
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Robyn Speer
a69b66b210
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00