Andrew Lin
e7b34fb655
Merge pull request #34 from LuminosoInsight/big-list
...
wordfreq 1.4: some bigger wordlists, better use of language detection
2016-05-11 16:27:51 -04:00
Rob Speer
dcb77a552b
fix to README: we're only using Reddit in English
2016-05-11 15:38:29 -04:00
Rob Speer
2276d97368
limit Reddit data to just English
2016-04-15 17:01:21 -04:00
Rob Speer
ced15d6eff
remove reddit_base_filename function
2016-03-31 13:39:13 -04:00
Rob Speer
ff1f0e4678
use path.stem
to make the Reddit filename prefix
2016-03-31 13:13:52 -04:00
Rob Speer
16059d3b9a
rename max_size to max_words consistently
2016-03-31 12:55:18 -04:00
Rob Speer
697842b3f9
fix table showing marginal Korean support
2016-03-30 15:11:13 -04:00
Rob Speer
ed32b278cc
make an example clearer with wordlist='large'
2016-03-30 15:08:32 -04:00
Rob Speer
a10c1d7ac0
update wordlists for new builder settings
2016-03-28 12:26:47 -04:00
Rob Speer
abbc295538
Discard text detected as an uncommon language; add large German list
2016-03-28 12:26:02 -04:00
Rob Speer
08130908c7
oh look, more spam
2016-03-24 18:42:47 -04:00
Rob Speer
5b98794b86
filter out downvoted Reddit posts
2016-03-24 18:05:13 -04:00
Rob Speer
cfe68893fa
disregard Arabic Reddit spam
2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381
fix extraneous dot in intermediate filenames
2016-03-24 16:52:44 -04:00
Rob Speer
1df97a579e
bump version to 1.4
2016-03-24 16:29:29 -04:00
Rob Speer
75a4a92110
actually use the results of language-detection on Reddit
2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494
make max-words a real, documented parameter
2016-03-24 14:10:02 -04:00
Rob Speer
7b539f9057
Merge pull request #33 from LuminosoInsight/bugfix
...
Restore a missing comma.
2016-03-24 13:59:50 -04:00
Andrew Lin
38016cf62b
Restore a missing comma.
2016-03-24 13:57:18 -04:00
Andrew Lin
84497429e1
Merge pull request #32 from LuminosoInsight/thai-fix
...
Leave Thai segments alone in the default regex
2016-03-10 11:57:44 -05:00
Rob Speer
4ec6b56faa
move Thai test to where it makes more sense
2016-03-10 11:56:15 -05:00
Rob Speer
07f16e6f03
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
Rob Speer
d79ee37da9
Add and document large wordlists
2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec
configuration that builds some larger lists
2016-01-22 14:20:12 -05:00
Rob Speer
9907948d11
add Zipf scale
2016-01-21 14:07:01 -05:00
slibs63
d18fee3d78
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca
fix documentation in wordfreq_builder.tokenizers
2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91
reformat some argparse argument definitions
2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d
build a bigger wordlist that we can optionally use
2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab
fix usage text: one comment, not one tweet
2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e
Separate tokens with spaces, not line breaks, in intermediate files
2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93
Specify encoding when dealing with files
2015-12-23 15:49:13 -05:00
Rob Speer
973caca253
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb
gzip the intermediate step of Reddit word counting
2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b
no Thai because we can't tokenize it
2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b
forgot about Italian
2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d
add tokenizer for Reddit
2015-11-30 18:16:54 -05:00
Rob Speer
2dcf368481
rebuild data files
2015-11-30 17:06:39 -05:00
Rob Speer
b2d7546d2d
add word frequencies from the Reddit 2007-2015 corpus
2015-11-30 16:38:11 -05:00
Rob Speer
e1f7a1ccf3
add docstrings to chinese_ and japanese_tokenize
2015-10-27 13:23:56 -04:00
Lance Nathan
ca00dfa1d9
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
...
Add some tokenizer options
2015-10-19 18:21:52 -04:00
Rob Speer
a6b6aa07e7
Define globals in relevant places
2015-10-19 18:15:54 -04:00
Rob Speer
bfc17fea9f
clarify the tokenize docstring
2015-10-19 12:18:12 -04:00
Rob Speer
1793c1bb2e
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-28 14:34:59 -04:00
Andrew Lin
15d99be21b
Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
...
Fix documentation and clean up, based on Sep 25 code review
2015-09-28 13:53:50 -04:00
Rob Speer
44b0c4f9ba
Fix documentation and clean up, based on Sep 25 code review
2015-09-28 12:58:46 -04:00
Rob Speer
9b1c4d66cd
fix missing word in rules.ninja comment
2015-09-24 17:56:06 -04:00
Rob Speer
b460eef444
describe optional dependencies better in the README
2015-09-24 17:54:52 -04:00