Commit Graph

628 Commits

Author SHA1 Message Date
Robyn Speer
969a024dea actually use the results of language-detection on Reddit
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
fbc19995ab Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Robyn Speer
f493d0eec4 make max-words a real, documented parameter
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Robyn Speer
298cb69353 Merge pull request #33 from LuminosoInsight/bugfix
Restore a missing comma.

Former-commit-id: 7b539f9057
2016-03-24 13:59:50 -04:00
Andrew Lin
1942bc690f Restore a missing comma.
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Andrew Lin
68e7846d50 Merge pull request #32 from LuminosoInsight/thai-fix
Leave Thai segments alone in the default regex

Former-commit-id: 84497429e1
2016-03-10 11:57:44 -05:00
Robyn Speer
f25985379c move Thai test to where it makes more sense
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Robyn Speer
51e260b713 Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.


Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Robyn Speer
6344b38194 Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
12e779fc79 configuration that builds some larger lists
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Robyn Speer
83559a53d4 add Zipf scale
Former-commit-id: 9907948d11
2016-01-21 14:07:01 -05:00
slibs63
927d4f45a4 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus

Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Robyn Speer
6eca3cff5a fix documentation in wordfreq_builder.tokenizers
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Robyn Speer
95cdf41fe8 reformat some argparse argument definitions
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Robyn Speer
738243e244 build a bigger wordlist that we can optionally use
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Robyn Speer
2069e30c89 fix usage text: one comment, not one tweet
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Robyn Speer
883aa5baeb Separate tokens with spaces, not line breaks, in intermediate files
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Andrew Lin
eae7b2752e Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files

Former-commit-id: f30efebba0
2015-12-23 16:13:47 -05:00
Sara Jewett
42d209cbe2 Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Robyn Speer
7d1719cfb4 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.


Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
682e08fee2 no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Robyn Speer
064ee22a33 forgot about Italian
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Robyn Speer
ab8c2e2331 add tokenizer for Reddit
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Robyn Speer
23949a4512 rebuild data files
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Robyn Speer
6d2709f064 add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
eb08c0a951 add docstrings to chinese_ and japanese_tokenize
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f4d865c0be Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options

Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Robyn Speer
5fedd71a66 Define globals in relevant places
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00
Robyn Speer
91a81c1bde clarify the tokenize docstring
Former-commit-id: bfc17fea9f
2015-10-19 12:18:12 -04:00
Robyn Speer
c9693c9502 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Andrew Lin
6d5ead0b47 Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
Fix documentation and clean up, based on Sep 25 code review

Former-commit-id: 15d99be21b
2015-09-28 13:53:50 -04:00
Robyn Speer
f3f66508bd Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Robyn Speer
7494ae27a7 fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
8e963dc312 describe optional dependencies better in the README
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
960dc437a2 update and clean up the tokenize() docstring
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Robyn Speer
4a4534c466 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e27a75029d Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 2089090151 [formerly db41bc7902].


Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
bb4653f16f Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish

Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
e7d46fb104 Revert a small syntax change introduced by a circular series of changes.
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477 don't apply the inferred-space penalty to Japanese
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
6b163e5772 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 2089090151 [formerly db41bc7902].


Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Robyn Speer
d215f79ea3 describe the use of lang in read_values
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03 Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
b4628abb38 actually, still delay loading the Jieba tokenizer
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Robyn Speer
13642d6a4d replace the literal 10 with the constant INFERRED_SPACE_FACTOR
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Robyn Speer
01f9c07c33 remove unnecessary delayed loads in wordfreq.chinese
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Robyn Speer
db30d09947 load the Chinese character mapping from a .msgpack.gz file
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Robyn Speer
fe8a6b51e7 document what this file is for
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00