Robyn Speer
cecf852040
update wordlists for new builder settings
...
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Robyn Speer
0c7527140c
Discard text detected as an uncommon language; add large German list
...
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Robyn Speer
aa7802b552
oh look, more spam
...
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Robyn Speer
2840ca55aa
filter out downvoted Reddit posts
...
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Robyn Speer
16841d4b0c
disregard Arabic Reddit spam
...
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Robyn Speer
034d8f540b
fix extraneous dot in intermediate filenames
...
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Robyn Speer
460fbb84fd
bump version to 1.4
...
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
Robyn Speer
969a024dea
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
fbc19995ab
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Robyn Speer
f493d0eec4
make max-words a real, documented parameter
...
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Robyn Speer
298cb69353
Merge pull request #33 from LuminosoInsight/bugfix
...
Restore a missing comma.
Former-commit-id: 7b539f9057
2016-03-24 13:59:50 -04:00
Andrew Lin
1942bc690f
Restore a missing comma.
...
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Andrew Lin
68e7846d50
Merge pull request #32 from LuminosoInsight/thai-fix
...
Leave Thai segments alone in the default regex
Former-commit-id: 84497429e1
2016-03-10 11:57:44 -05:00
Robyn Speer
f25985379c
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Robyn Speer
51e260b713
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Robyn Speer
6344b38194
Add and document large wordlists
...
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
12e779fc79
configuration that builds some larger lists
...
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Robyn Speer
83559a53d4
add Zipf scale
...
Former-commit-id: 9907948d11
2016-01-21 14:07:01 -05:00
slibs63
927d4f45a4
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Robyn Speer
6eca3cff5a
fix documentation in wordfreq_builder.tokenizers
...
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Robyn Speer
95cdf41fe8
reformat some argparse argument definitions
...
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Robyn Speer
738243e244
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Robyn Speer
2069e30c89
fix usage text: one comment, not one tweet
...
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Robyn Speer
883aa5baeb
Separate tokens with spaces, not line breaks, in intermediate files
...
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Andrew Lin
eae7b2752e
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
Former-commit-id: f30efebba0
2015-12-23 16:13:47 -05:00
Sara Jewett
42d209cbe2
Specify encoding when dealing with files
...
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Robyn Speer
7d1719cfb4
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
682e08fee2
no Thai because we can't tokenize it
...
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Robyn Speer
064ee22a33
forgot about Italian
...
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Robyn Speer
ab8c2e2331
add tokenizer for Reddit
...
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Robyn Speer
23949a4512
rebuild data files
...
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Robyn Speer
6d2709f064
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
eb08c0a951
add docstrings to chinese_ and japanese_tokenize
...
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f4d865c0be
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
...
Add some tokenizer options
Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Robyn Speer
5fedd71a66
Define globals in relevant places
...
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00
Robyn Speer
91a81c1bde
clarify the tokenize docstring
...
Former-commit-id: bfc17fea9f
2015-10-19 12:18:12 -04:00
Robyn Speer
c9693c9502
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Andrew Lin
6d5ead0b47
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
...
Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 15d99be21b
2015-09-28 13:53:50 -04:00
Robyn Speer
f3f66508bd
Fix documentation and clean up, based on Sep 25 code review
...
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Robyn Speer
7494ae27a7
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
8e963dc312
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
960dc437a2
update and clean up the tokenize() docstring
...
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Robyn Speer
4a4534c466
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e27a75029d
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
bb4653f16f
Merge pull request #27 from LuminosoInsight/chinese-and-more
...
Improve Chinese, Greek, English; add Turkish, Polish, Swedish
Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
e7d46fb104
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
6b163e5772
Revert "Remove the no-longer-existent .txt files from the MANIFEST."
...
This reverts commit 2089090151
[formerly db41bc7902
].
Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00