Rob Speer
c2eab6881e
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Rob Speer
a32162c04f
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Rob Speer
23c5c4adca
Add and document large wordlists
...
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
3b95d349e0
configuration that builds some larger lists
...
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Rob Speer
35ee23591e
add Zipf scale
...
Former-commit-id: 9907948d11
2016-01-21 14:07:01 -05:00
slibs63
258f5088e9
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Rob Speer
ee8cfb5a50
fix documentation in wordfreq_builder.tokenizers
...
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Rob Speer
56f830d678
reformat some argparse argument definitions
...
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Rob Speer
f4761029d0
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Rob Speer
83bd019efe
fix usage text: one comment, not one tweet
...
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Rob Speer
1d3485c855
Separate tokens with spaces, not line breaks, in intermediate files
...
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Andrew Lin
c9f679a7a3
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
Former-commit-id: f30efebba0
2015-12-23 16:13:47 -05:00
Sara Jewett
7b6f88b059
Specify encoding when dealing with files
...
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
6d62a8ff51
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
dc94222d7d
no Thai because we can't tokenize it
...
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Rob Speer
237fabb4c5
forgot about Italian
...
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Rob Speer
6caa9ca443
add tokenizer for Reddit
...
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Rob Speer
9a1b00ba0c
rebuild data files
...
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
d1b667909d
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
49b8ba4be9
add docstrings to chinese_ and japanese_tokenize
...
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f47249064f
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
...
Add some tokenizer options
Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Rob Speer
668a985969
Define globals in relevant places
...
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00
Rob Speer
f255eb5bd8
clarify the tokenize docstring
...
Former-commit-id: bfc17fea9f
2015-10-19 12:18:12 -04:00
Rob Speer
8fea2ca181
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Andrew Lin
d8422852f4
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
...
Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 15d99be21b
2015-09-28 13:53:50 -04:00
Rob Speer
3bd1fe2fe6
Fix documentation and clean up, based on Sep 25 code review
...
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Rob Speer
7435c8f57a
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Rob Speer
7c596de98a
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Rob Speer
28381d5a51
update and clean up the tokenize() docstring
...
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Rob Speer
f89ac5e400
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Rob Speer
faf66e9b08
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
c53bb06988
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 65d6645e81
[formerly db41bc7902
].
Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
566a62abd5
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
ee6df56514
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Rob Speer
1b7117952b
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
4ccfcdc1bd
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 65d6645e81
[formerly db41bc7902
].
Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Rob Speer
88deef24f6
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Rob Speer
7cb310b28e
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Rob Speer
d68dd9f568
actually, still delay loading the Jieba tokenizer
...
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Rob Speer
0e4daa8472
replace the literal 10 with the constant INFERRED_SPACE_FACTOR
...
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Rob Speer
5929975338
remove unnecessary delayed loads in wordfreq.chinese
...
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Rob Speer
42ccba4fa6
load the Chinese character mapping from a .msgpack.gz file
...
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Rob Speer
e12a42f38a
document what this file is for
...
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Rob Speer
76c4a8975a
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
963e0ff785
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Rob Speer
e3a79ab8c9
add external_wordlist
option to tokenize
...
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Rob Speer
7f92557a58
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Andrew Lin
800039f0f8
Merge pull request #26 from LuminosoInsight/greek-and-turkish
...
Add SUBTLEX, support Turkish, expand Greek
Former-commit-id: acbb25e6f6
2015-09-10 13:48:33 -04:00