Robyn Speer
c9693c9502
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Andrew Lin
6d5ead0b47
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
...
Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 15d99be21b
2015-09-28 13:53:50 -04:00
Robyn Speer
f3f66508bd
Fix documentation and clean up, based on Sep 25 code review
...
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Robyn Speer
7494ae27a7
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
8e963dc312
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
960dc437a2
update and clean up the tokenize() docstring
...
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Robyn Speer
4a4534c466
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e27a75029d
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
bb4653f16f
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
e7d46fb104
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
6b163e5772
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Robyn Speer
d215f79ea3
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
b4628abb38
actually, still delay loading the Jieba tokenizer
...
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Robyn Speer
13642d6a4d
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
...
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Robyn Speer
01f9c07c33
remove unnecessary delayed loads in wordfreq.chinese
...
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Robyn Speer
db30d09947
load the Chinese character mapping from a .msgpack.gz file
...
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Robyn Speer
fe8a6b51e7
document what this file is for
...
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Robyn Speer
6802a4f89d
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Robyn Speer
9a007b9948
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Robyn Speer
1adbb1aaf1
add external_wordlist
option to tokenize
...
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Robyn Speer
f2be213933
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
f0c7c3a02c
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Andrew Lin
66f1afe4d7
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
Former-commit-id: acbb25e6f6
2015-09-10 13:48:33 -04:00
Robyn Speer
c5d5b0b1fe
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
3dd70ed1c2
fix SUBTLEX citations
...
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Robyn Speer
1d3521dfda
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
c1f27d3095
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
e77c2dbca8
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Robyn Speer
5b9b2d2d02
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
f7a4e2c444
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Robyn Speer
4704131e13
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
a75a95658b
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00