Rob Speer
d68dd9f568
actually, still delay loading the Jieba tokenizer
...
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Rob Speer
0e4daa8472
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
...
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Rob Speer
5929975338
remove unnecessary delayed loads in wordfreq.chinese
...
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Rob Speer
42ccba4fa6
load the Chinese character mapping from a .msgpack.gz file
...
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Rob Speer
e12a42f38a
document what this file is for
...
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Rob Speer
76c4a8975a
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
7f92557a58
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Rob Speer
e3cc8eaea9
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Rob Speer
5701c1165d
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Rob Speer
9c08442dc5
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
37e5e1009f
fix SUBTLEX citations
...
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Rob Speer
0f9497d864
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
5e86394c4c
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Rob Speer
2dfaf7798d
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Rob Speer
01332f1ed5
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
86475d6b5f
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Rob Speer
6bd0979ad2
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Rob Speer
8c3fb9f716
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Rob Speer
67bb55988e
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Rob Speer
11202ad7f5
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Rob Speer
30237cf73d
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Rob Speer
854247bf8b
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Rob Speer
b4100b5bfb
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
a555e5dc13
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
1d4a18ead2
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Rob Speer
63295fc397
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
0441a81bbe
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
138e8aaa3f
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a8161b1067
update data files (without the CLD2 fix yet)
...
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Rob Speer
3a8b2c2c81
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Rob Speer
a0997a79a4
update README with additional SUBTLEX support
...
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
b1d158ab41
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
f993ffcdf2
update the data
...
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Rob Speer
25e24f9c32
Note on next languages to support
...
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Rob Speer
bf88f97744
expand list of sources and supported languages
...
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
89763679de
Merge branch 'add-subtlex' into greek-and-turkish
...
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Rob Speer
ad4b12bee9
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
7a2f2035ab
expand Greek and enable Turkish in config
...
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Rob Speer
a92c398258
add SUBTLEX to the readme
...
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
cb5b696ffa
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Rob Speer
531cedbf55
Merge pull request #24 from LuminosoInsight/manifest
...
Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: 07228fdf1d
2015-09-02 14:34:17 -04:00
Andrew Lin
65d6645e81
Remove the no-longer-existent .txt files from the MANIFEST.
...
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Andrew Lin
b693715663
Merge pull request #23 from LuminosoInsight/readme
...
Put documentation and examples in the README
Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00