Rob Speer
07f16e6f03
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
slibs63
d18fee3d78
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca
fix documentation in wordfreq_builder.tokenizers
2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91
reformat some argparse argument definitions
2016-01-13 12:05:07 -05:00
Rob Speer
8d9668d8ab
fix usage text: one comment, not one tweet
2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e
Separate tokens with spaces, not line breaks, in intermediate files
2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93
Specify encoding when dealing with files
2015-12-23 15:49:13 -05:00
Rob Speer
973caca253
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb
gzip the intermediate step of Reddit word counting
2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b
no Thai because we can't tokenize it
2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b
forgot about Italian
2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d
add tokenizer for Reddit
2015-11-30 18:16:54 -05:00
Rob Speer
2dcf368481
rebuild data files
2015-11-30 17:06:39 -05:00
Rob Speer
b2d7546d2d
add word frequencies from the Reddit 2007-2015 corpus
2015-11-30 16:38:11 -05:00
Rob Speer
e1f7a1ccf3
add docstrings to chinese_ and japanese_tokenize
2015-10-27 13:23:56 -04:00
Lance Nathan
ca00dfa1d9
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
...
Add some tokenizer options
2015-10-19 18:21:52 -04:00
Rob Speer
a6b6aa07e7
Define globals in relevant places
2015-10-19 18:15:54 -04:00
Rob Speer
bfc17fea9f
clarify the tokenize docstring
2015-10-19 12:18:12 -04:00
Rob Speer
1793c1bb2e
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-28 14:34:59 -04:00
Andrew Lin
15d99be21b
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
...
Fix documentation and clean up, based on Sep 25 code review
2015-09-28 13:53:50 -04:00
Rob Speer
44b0c4f9ba
Fix documentation and clean up, based on Sep 25 code review
2015-09-28 12:58:46 -04:00
Rob Speer
9b1c4d66cd
fix missing word in rules.ninja comment
2015-09-24 17:56:06 -04:00
Rob Speer
b460eef444
describe optional dependencies better in the README
2015-09-24 17:54:52 -04:00
Rob Speer
24b16d8a5d
update and clean up the tokenize() docstring
2015-09-24 17:47:16 -04:00
Rob Speer
2a84a926f5
test_chinese: fix typo in comment
2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
cd0797e1c8
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit db41bc7902
.
2015-09-24 13:31:34 -04:00
Andrew Lin
710eaabbe1
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
2015-09-24 13:25:21 -04:00
Andrew Lin
09597b7cf3
Revert a small syntax change introduced by a circular series of changes.
2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051
don't apply the inferred-space penalty to Japanese
2015-09-24 12:50:06 -04:00
Andrew Lin
bb70bdba58
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit db41bc7902
.
2015-09-23 13:02:40 -04:00
Rob Speer
f224b8dbba
describe the use of lang
in read_values
2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1
Make the jieba_deps comment make sense
2015-09-22 17:19:00 -04:00
Rob Speer
48734d1a60
actually, still delay loading the Jieba tokenizer
2015-09-22 16:54:39 -04:00
Rob Speer
7a3ea2bf79
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
2015-09-22 16:46:07 -04:00
Rob Speer
4a87890afd
remove unnecessary delayed loads in wordfreq.chinese
2015-09-22 16:42:13 -04:00
Rob Speer
6cf4210187
load the Chinese character mapping from a .msgpack.gz file
2015-09-22 16:32:33 -04:00
Rob Speer
06f8b29971
document what this file is for
2015-09-22 15:31:27 -04:00
Rob Speer
5b918e7bb0
fix README conflict
2015-09-22 14:23:55 -04:00
Rob Speer
e8e6e0a231
refactor the tokenizer, add include_punctuation
option
2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13
add external_wordlist
option to tokenize
2015-09-10 18:09:41 -04:00
Rob Speer
3cb3061e06
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
5c8c36f4e3
Lower the frequency of phrases with inferred token boundaries
2015-09-10 14:16:22 -04:00
Andrew Lin
acbb25e6f6
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
2015-09-10 13:48:33 -04:00
Rob Speer
a4f8d11427
In ninja deps, remove 'startrow' as a variable
2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116
fix spelling of Marc
2015-09-09 13:35:02 -04:00
Rob Speer
354555514f
fixes based on code review notes
2015-09-09 13:10:18 -04:00
Rob Speer
6502f15e9b
fix SUBTLEX citations
2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc
take out OpenSubtitles for Chinese
2015-09-08 17:25:05 -04:00