Rob Speer
7c596de98a
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Rob Speer
28381d5a51
update and clean up the tokenize() docstring
...
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Rob Speer
f89ac5e400
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Rob Speer
faf66e9b08
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
c53bb06988
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 65d6645e81
[formerly db41bc7902
].
Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
566a62abd5
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
ee6df56514
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Rob Speer
1b7117952b
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
4ccfcdc1bd
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 65d6645e81
[formerly db41bc7902
].
Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Rob Speer
88deef24f6
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Rob Speer
7cb310b28e
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Rob Speer
d68dd9f568
actually, still delay loading the Jieba tokenizer
...
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Rob Speer
0e4daa8472
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
...
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Rob Speer
5929975338
remove unnecessary delayed loads in wordfreq.chinese
...
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Rob Speer
42ccba4fa6
load the Chinese character mapping from a .msgpack.gz file
...
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Rob Speer
e12a42f38a
document what this file is for
...
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Rob Speer
76c4a8975a
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
963e0ff785
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Rob Speer
e3a79ab8c9
add external_wordlist
option to tokenize
...
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Rob Speer
7f92557a58
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Andrew Lin
800039f0f8
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
Former-commit-id: acbb25e6f6
2015-09-10 13:48:33 -04:00
Rob Speer
e3cc8eaea9
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Rob Speer
5701c1165d
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Rob Speer
9c08442dc5
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
37e5e1009f
fix SUBTLEX citations
...
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Rob Speer
0f9497d864
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
5e86394c4c
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Rob Speer
2dfaf7798d
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Rob Speer
01332f1ed5
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
86475d6b5f
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Rob Speer
6bd0979ad2
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Rob Speer
8c3fb9f716
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Rob Speer
67bb55988e
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Rob Speer
11202ad7f5
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Rob Speer
30237cf73d
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Rob Speer
854247bf8b
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Rob Speer
b4100b5bfb
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
a555e5dc13
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
1d4a18ead2
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Rob Speer
63295fc397
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
0441a81bbe
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
138e8aaa3f
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a8161b1067
update data files (without the CLD2 fix yet)
...
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Rob Speer
3a8b2c2c81
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00