Robyn Speer
|
6344b38194
|
Add and document large wordlists
Former-commit-id: d79ee37da9
|
2016-01-22 16:23:43 -05:00 |
|
Robyn Speer
|
12e779fc79
|
configuration that builds some larger lists
Former-commit-id: c1a12cebec
|
2016-01-22 14:20:12 -05:00 |
|
Robyn Speer
|
83559a53d4
|
add Zipf scale
Former-commit-id: 9907948d11
|
2016-01-21 14:07:01 -05:00 |
|
slibs63
|
927d4f45a4
|
Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
|
2016-01-14 15:52:39 -05:00 |
|
Robyn Speer
|
6eca3cff5a
|
fix documentation in wordfreq_builder.tokenizers
Former-commit-id: 8ddc19a5ca
|
2016-01-13 15:18:12 -05:00 |
|
Robyn Speer
|
95cdf41fe8
|
reformat some argparse argument definitions
Former-commit-id: 511fcb6f91
|
2016-01-13 12:05:07 -05:00 |
|
Robyn Speer
|
738243e244
|
build a bigger wordlist that we can optionally use
Former-commit-id: df8caaff7d
|
2016-01-12 14:05:57 -05:00 |
|
Robyn Speer
|
2069e30c89
|
fix usage text: one comment, not one tweet
Former-commit-id: 8d9668d8ab
|
2016-01-12 13:05:38 -05:00 |
|
Robyn Speer
|
883aa5baeb
|
Separate tokens with spaces, not line breaks, in intermediate files
Former-commit-id: 115c74583e
|
2016-01-12 12:59:18 -05:00 |
|
Andrew Lin
|
eae7b2752e
|
Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files
Former-commit-id: f30efebba0
|
2015-12-23 16:13:47 -05:00 |
|
Sara Jewett
|
42d209cbe2
|
Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
|
2015-12-23 15:49:13 -05:00 |
|
Robyn Speer
|
7d1719cfb4
|
builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
|
2015-12-15 14:44:34 -05:00 |
|
Robyn Speer
|
f5e09f3f3d
|
gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
|
2015-12-09 13:30:08 -05:00 |
|
Robyn Speer
|
682e08fee2
|
no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
|
2015-12-02 12:38:03 -05:00 |
|
Robyn Speer
|
064ee22a33
|
forgot about Italian
Former-commit-id: 8f6cd0e57b
|
2015-11-30 18:18:24 -05:00 |
|
Robyn Speer
|
ab8c2e2331
|
add tokenizer for Reddit
Former-commit-id: 5ef807117d
|
2015-11-30 18:16:54 -05:00 |
|
Robyn Speer
|
23949a4512
|
rebuild data files
Former-commit-id: 2dcf368481
|
2015-11-30 17:06:39 -05:00 |
|
Robyn Speer
|
6d2709f064
|
add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
|
2015-11-30 16:38:11 -05:00 |
|
Robyn Speer
|
eb08c0a951
|
add docstrings to chinese_ and japanese_tokenize
Former-commit-id: e1f7a1ccf3
|
2015-10-27 13:23:56 -04:00 |
|
Lance Nathan
|
f4d865c0be
|
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options
Former-commit-id: ca00dfa1d9
|
2015-10-19 18:21:52 -04:00 |
|
Robyn Speer
|
5fedd71a66
|
Define globals in relevant places
Former-commit-id: a6b6aa07e7
|
2015-10-19 18:15:54 -04:00 |
|
Robyn Speer
|
91a81c1bde
|
clarify the tokenize docstring
Former-commit-id: bfc17fea9f
|
2015-10-19 12:18:12 -04:00 |
|
Robyn Speer
|
c9693c9502
|
Merge branch 'master' into chinese-external-wordlist
Conflicts:
wordfreq/chinese.py
Former-commit-id: 1793c1bb2e
|
2015-09-28 14:34:59 -04:00 |
|
Andrew Lin
|
6d5ead0b47
|
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 15d99be21b
|
2015-09-28 13:53:50 -04:00 |
|
Robyn Speer
|
f3f66508bd
|
Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
|
2015-09-28 12:58:46 -04:00 |
|
Robyn Speer
|
7494ae27a7
|
fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
|
2015-09-24 17:56:06 -04:00 |
|
Robyn Speer
|
8e963dc312
|
describe optional dependencies better in the README
Former-commit-id: b460eef444
|
2015-09-24 17:54:52 -04:00 |
|
Robyn Speer
|
960dc437a2
|
update and clean up the tokenize() docstring
Former-commit-id: 24b16d8a5d
|
2015-09-24 17:47:16 -04:00 |
|
Robyn Speer
|
4a4534c466
|
test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
|
2015-09-24 13:41:11 -04:00 |
|
Robyn Speer
|
e15a231401
|
Merge branch 'master' into chinese-external-wordlist
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
|
2015-09-24 13:40:08 -04:00 |
|
Andrew Lin
|
e27a75029d
|
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 2089090151 [formerly db41bc7902 ].
Former-commit-id: cd0797e1c8
|
2015-09-24 13:31:34 -04:00 |
|
Andrew Lin
|
bb4653f16f
|
Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
|
2015-09-24 13:25:21 -04:00 |
|
Andrew Lin
|
e7d46fb104
|
Revert a small syntax change introduced by a circular series of changes.
Former-commit-id: 09597b7cf3
|
2015-09-24 13:24:11 -04:00 |
|
Robyn Speer
|
4d00f17477
|
don't apply the inferred-space penalty to Japanese
Former-commit-id: db5eda6051
|
2015-09-24 12:50:06 -04:00 |
|
Andrew Lin
|
6b163e5772
|
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 2089090151 [formerly db41bc7902 ].
Former-commit-id: bb70bdba58
|
2015-09-23 13:02:40 -04:00 |
|
Robyn Speer
|
d215f79ea3
|
describe the use of lang in read_values
Former-commit-id: f224b8dbba
|
2015-09-22 17:22:38 -04:00 |
|
Robyn Speer
|
e6e29a1c03
|
Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
|
2015-09-22 17:19:00 -04:00 |
|
Robyn Speer
|
b4628abb38
|
actually, still delay loading the Jieba tokenizer
Former-commit-id: 48734d1a60
|
2015-09-22 16:54:39 -04:00 |
|
Robyn Speer
|
13642d6a4d
|
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
Former-commit-id: 7a3ea2bf79
|
2015-09-22 16:46:07 -04:00 |
|
Robyn Speer
|
01f9c07c33
|
remove unnecessary delayed loads in wordfreq.chinese
Former-commit-id: 4a87890afd
|
2015-09-22 16:42:13 -04:00 |
|
Robyn Speer
|
db30d09947
|
load the Chinese character mapping from a .msgpack.gz file
Former-commit-id: 6cf4210187
|
2015-09-22 16:32:33 -04:00 |
|
Robyn Speer
|
fe8a6b51e7
|
document what this file is for
Former-commit-id: 06f8b29971
|
2015-09-22 15:31:27 -04:00 |
|
Robyn Speer
|
6802a4f89d
|
fix README conflict
Former-commit-id: 5b918e7bb0
|
2015-09-22 14:23:55 -04:00 |
|
Robyn Speer
|
9a007b9948
|
refactor the tokenizer, add include_punctuation option
Former-commit-id: e8e6e0a231
|
2015-09-15 13:26:09 -04:00 |
|
Robyn Speer
|
1adbb1aaf1
|
add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
|
2015-09-10 18:09:41 -04:00 |
|
Robyn Speer
|
f2be213933
|
Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
|
2015-09-10 15:27:33 -04:00 |
|
Robyn Speer
|
f0c7c3a02c
|
Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
|
2015-09-10 14:16:22 -04:00 |
|
Andrew Lin
|
66f1afe4d7
|
Merge pull request #26 from LuminosoInsight/greek-and-turkish
Add SUBTLEX, support Turkish, expand Greek
Former-commit-id: acbb25e6f6
|
2015-09-10 13:48:33 -04:00 |
|
Robyn Speer
|
c5d5b0b1fe
|
In ninja deps, remove 'startrow' as a variable
Former-commit-id: a4f8d11427
|
2015-09-10 13:46:19 -04:00 |
|
Robyn Speer
|
acddc3ca05
|
fix spelling of Marc
Former-commit-id: 2277ad3116
|
2015-09-09 13:35:02 -04:00 |
|