Rob Speer
|
df8caaff7d
|
build a bigger wordlist that we can optionally use
|
2016-01-12 14:05:57 -05:00 |
|
Rob Speer
|
8d9668d8ab
|
fix usage text: one comment, not one tweet
|
2016-01-12 13:05:38 -05:00 |
|
Rob Speer
|
115c74583e
|
Separate tokens with spaces, not line breaks, in intermediate files
|
2016-01-12 12:59:18 -05:00 |
|
Rob Speer
|
973caca253
|
builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
|
2015-12-15 14:44:34 -05:00 |
|
Rob Speer
|
9a5d9d66bb
|
gzip the intermediate step of Reddit word counting
|
2015-12-09 13:30:08 -05:00 |
|
Rob Speer
|
95f53e295b
|
no Thai because we can't tokenize it
|
2015-12-02 12:38:03 -05:00 |
|
Rob Speer
|
8f6cd0e57b
|
forgot about Italian
|
2015-11-30 18:18:24 -05:00 |
|
Rob Speer
|
5ef807117d
|
add tokenizer for Reddit
|
2015-11-30 18:16:54 -05:00 |
|
Rob Speer
|
2dcf368481
|
rebuild data files
|
2015-11-30 17:06:39 -05:00 |
|
Rob Speer
|
b2d7546d2d
|
add word frequencies from the Reddit 2007-2015 corpus
|
2015-11-30 16:38:11 -05:00 |
|
Rob Speer
|
e1f7a1ccf3
|
add docstrings to chinese_ and japanese_tokenize
|
2015-10-27 13:23:56 -04:00 |
|
Lance Nathan
|
ca00dfa1d9
|
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options
|
2015-10-19 18:21:52 -04:00 |
|
Rob Speer
|
a6b6aa07e7
|
Define globals in relevant places
|
2015-10-19 18:15:54 -04:00 |
|
Rob Speer
|
bfc17fea9f
|
clarify the tokenize docstring
|
2015-10-19 12:18:12 -04:00 |
|
Rob Speer
|
1793c1bb2e
|
Merge branch 'master' into chinese-external-wordlist
Conflicts:
wordfreq/chinese.py
|
2015-09-28 14:34:59 -04:00 |
|
Andrew Lin
|
15d99be21b
|
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
Fix documentation and clean up, based on Sep 25 code review
|
2015-09-28 13:53:50 -04:00 |
|
Rob Speer
|
44b0c4f9ba
|
Fix documentation and clean up, based on Sep 25 code review
|
2015-09-28 12:58:46 -04:00 |
|
Rob Speer
|
9b1c4d66cd
|
fix missing word in rules.ninja comment
|
2015-09-24 17:56:06 -04:00 |
|
Rob Speer
|
b460eef444
|
describe optional dependencies better in the README
|
2015-09-24 17:54:52 -04:00 |
|
Rob Speer
|
24b16d8a5d
|
update and clean up the tokenize() docstring
|
2015-09-24 17:47:16 -04:00 |
|
Rob Speer
|
2a84a926f5
|
test_chinese: fix typo in comment
|
2015-09-24 13:41:11 -04:00 |
|
Rob Speer
|
cea2a61444
|
Merge branch 'master' into chinese-external-wordlist
Conflicts:
wordfreq/chinese.py
|
2015-09-24 13:40:08 -04:00 |
|
Andrew Lin
|
cd0797e1c8
|
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902 .
|
2015-09-24 13:31:34 -04:00 |
|
Andrew Lin
|
710eaabbe1
|
Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
|
2015-09-24 13:25:21 -04:00 |
|
Andrew Lin
|
09597b7cf3
|
Revert a small syntax change introduced by a circular series of changes.
|
2015-09-24 13:24:11 -04:00 |
|
Rob Speer
|
db5eda6051
|
don't apply the inferred-space penalty to Japanese
|
2015-09-24 12:50:06 -04:00 |
|
Andrew Lin
|
bb70bdba58
|
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902 .
|
2015-09-23 13:02:40 -04:00 |
|
Rob Speer
|
f224b8dbba
|
describe the use of lang in read_values
|
2015-09-22 17:22:38 -04:00 |
|
Rob Speer
|
7c12f2aca1
|
Make the jieba_deps comment make sense
|
2015-09-22 17:19:00 -04:00 |
|
Rob Speer
|
48734d1a60
|
actually, still delay loading the Jieba tokenizer
|
2015-09-22 16:54:39 -04:00 |
|
Rob Speer
|
7a3ea2bf79
|
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
|
2015-09-22 16:46:07 -04:00 |
|
Rob Speer
|
4a87890afd
|
remove unnecessary delayed loads in wordfreq.chinese
|
2015-09-22 16:42:13 -04:00 |
|
Rob Speer
|
6cf4210187
|
load the Chinese character mapping from a .msgpack.gz file
|
2015-09-22 16:32:33 -04:00 |
|
Rob Speer
|
06f8b29971
|
document what this file is for
|
2015-09-22 15:31:27 -04:00 |
|
Rob Speer
|
5b918e7bb0
|
fix README conflict
|
2015-09-22 14:23:55 -04:00 |
|
Rob Speer
|
e8e6e0a231
|
refactor the tokenizer, add include_punctuation option
|
2015-09-15 13:26:09 -04:00 |
|
Rob Speer
|
669bd16c13
|
add external_wordlist option to tokenize
|
2015-09-10 18:09:41 -04:00 |
|
Rob Speer
|
3cb3061e06
|
Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
|
2015-09-10 15:27:33 -04:00 |
|
Rob Speer
|
5c8c36f4e3
|
Lower the frequency of phrases with inferred token boundaries
|
2015-09-10 14:16:22 -04:00 |
|
Andrew Lin
|
acbb25e6f6
|
Merge pull request #26 from LuminosoInsight/greek-and-turkish
Add SUBTLEX, support Turkish, expand Greek
|
2015-09-10 13:48:33 -04:00 |
|
Rob Speer
|
a4f8d11427
|
In ninja deps, remove 'startrow' as a variable
|
2015-09-10 13:46:19 -04:00 |
|
Rob Speer
|
2277ad3116
|
fix spelling of Marc
|
2015-09-09 13:35:02 -04:00 |
|
Rob Speer
|
354555514f
|
fixes based on code review notes
|
2015-09-09 13:10:18 -04:00 |
|
Rob Speer
|
6502f15e9b
|
fix SUBTLEX citations
|
2015-09-08 17:45:25 -04:00 |
|
Rob Speer
|
d9c44d5fcc
|
take out OpenSubtitles for Chinese
|
2015-09-08 17:25:05 -04:00 |
|
Rob Speer
|
bc323eccaf
|
update comments in wordfreq_builder.config; remove unused 'version'
|
2015-09-08 16:15:29 -04:00 |
|
Rob Speer
|
0ab23f8a28
|
sort Jieba wordlists consistently; update data files
|
2015-09-08 16:09:53 -04:00 |
|
Rob Speer
|
bc8ebd23e9
|
don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.
|
2015-09-08 14:46:04 -04:00 |
|
Rob Speer
|
715361ca0d
|
actually fix logic of apostrophe-fixing
|
2015-09-08 13:50:34 -04:00 |
|
Rob Speer
|
c4c1af8213
|
fix logic of apostrophe-fixing
|
2015-09-08 13:47:58 -04:00 |
|