Commit Graph

417 Commits

Author SHA1 Message Date
Rob Speer
07f16e6f03 Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
slibs63
d18fee3d78 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca fix documentation in wordfreq_builder.tokenizers 2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91 reformat some argparse argument definitions 2016-01-13 12:05:07 -05:00
Rob Speer
8d9668d8ab fix usage text: one comment, not one tweet 2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e Separate tokens with spaces, not line breaks, in intermediate files 2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0 Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93 Specify encoding when dealing with files 2015-12-23 15:49:13 -05:00
Rob Speer
973caca253 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb gzip the intermediate step of Reddit word counting 2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b no Thai because we can't tokenize it 2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b forgot about Italian 2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d add tokenizer for Reddit 2015-11-30 18:16:54 -05:00
Rob Speer
2dcf368481 rebuild data files 2015-11-30 17:06:39 -05:00
Rob Speer
b2d7546d2d add word frequencies from the Reddit 2007-2015 corpus 2015-11-30 16:38:11 -05:00
Rob Speer
e1f7a1ccf3 add docstrings to chinese_ and japanese_tokenize 2015-10-27 13:23:56 -04:00
Lance Nathan
ca00dfa1d9 Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options
2015-10-19 18:21:52 -04:00
Rob Speer
a6b6aa07e7 Define globals in relevant places 2015-10-19 18:15:54 -04:00
Rob Speer
bfc17fea9f clarify the tokenize docstring 2015-10-19 12:18:12 -04:00
Rob Speer
1793c1bb2e Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py
2015-09-28 14:34:59 -04:00
Andrew Lin
15d99be21b Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
Fix documentation and clean up, based on Sep 25 code review
2015-09-28 13:53:50 -04:00
Rob Speer
44b0c4f9ba Fix documentation and clean up, based on Sep 25 code review 2015-09-28 12:58:46 -04:00
Rob Speer
9b1c4d66cd fix missing word in rules.ninja comment 2015-09-24 17:56:06 -04:00
Rob Speer
b460eef444 describe optional dependencies better in the README 2015-09-24 17:54:52 -04:00
Rob Speer
24b16d8a5d update and clean up the tokenize() docstring 2015-09-24 17:47:16 -04:00
Rob Speer
2a84a926f5 test_chinese: fix typo in comment 2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
cd0797e1c8 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902.
2015-09-24 13:31:34 -04:00
Andrew Lin
710eaabbe1 Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
2015-09-24 13:25:21 -04:00
Andrew Lin
09597b7cf3 Revert a small syntax change introduced by a circular series of changes. 2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051 don't apply the inferred-space penalty to Japanese 2015-09-24 12:50:06 -04:00
Andrew Lin
bb70bdba58 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit db41bc7902.
2015-09-23 13:02:40 -04:00
Rob Speer
f224b8dbba describe the use of lang in read_values 2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1 Make the jieba_deps comment make sense 2015-09-22 17:19:00 -04:00
Rob Speer
48734d1a60 actually, still delay loading the Jieba tokenizer 2015-09-22 16:54:39 -04:00
Rob Speer
7a3ea2bf79 replace the literal 10 with the constant INFERRED_SPACE_FACTOR 2015-09-22 16:46:07 -04:00
Rob Speer
4a87890afd remove unnecessary delayed loads in wordfreq.chinese 2015-09-22 16:42:13 -04:00
Rob Speer
6cf4210187 load the Chinese character mapping from a .msgpack.gz file 2015-09-22 16:32:33 -04:00
Rob Speer
06f8b29971 document what this file is for 2015-09-22 15:31:27 -04:00
Rob Speer
5b918e7bb0 fix README conflict 2015-09-22 14:23:55 -04:00
Rob Speer
e8e6e0a231 refactor the tokenizer, add include_punctuation option 2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13 add external_wordlist option to tokenize 2015-09-10 18:09:41 -04:00
Rob Speer
3cb3061e06 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
5c8c36f4e3 Lower the frequency of phrases with inferred token boundaries 2015-09-10 14:16:22 -04:00
Andrew Lin
acbb25e6f6 Merge pull request #26 from LuminosoInsight/greek-and-turkish
Add SUBTLEX, support Turkish, expand Greek
2015-09-10 13:48:33 -04:00
Rob Speer
a4f8d11427 In ninja deps, remove 'startrow' as a variable 2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116 fix spelling of Marc 2015-09-09 13:35:02 -04:00
Rob Speer
354555514f fixes based on code review notes 2015-09-09 13:10:18 -04:00
Rob Speer
6502f15e9b fix SUBTLEX citations 2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc take out OpenSubtitles for Chinese 2015-09-08 17:25:05 -04:00