Commit Graph

447 Commits

Author SHA1 Message Date
Rob Speer
270f6c7ca6 Fix tokenization of SE Asian and South Asian scripts (#37) 2016-07-01 18:00:57 -04:00
Rob Speer
88626aafee wordfreq_builder: Document the extract_reddit pipeline 2016-06-02 15:19:25 -04:00
Andrew Lin
3a6d985203 Merge pull request #35 from LuminosoInsight/big-list-test-fix
fix Arabic test, where 'lol' is no longer common
2016-05-11 17:20:01 -04:00
Rob Speer
da79dfb247 fix Arabic test, where 'lol' is no longer common 2016-05-11 17:01:47 -04:00
Andrew Lin
e7b34fb655 Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection
2016-05-11 16:27:51 -04:00
Rob Speer
dcb77a552b fix to README: we're only using Reddit in English 2016-05-11 15:38:29 -04:00
Rob Speer
2276d97368 limit Reddit data to just English 2016-04-15 17:01:21 -04:00
Rob Speer
ced15d6eff remove reddit_base_filename function 2016-03-31 13:39:13 -04:00
Rob Speer
ff1f0e4678 use path.stem to make the Reddit filename prefix 2016-03-31 13:13:52 -04:00
Rob Speer
16059d3b9a rename max_size to max_words consistently 2016-03-31 12:55:18 -04:00
Rob Speer
697842b3f9 fix table showing marginal Korean support 2016-03-30 15:11:13 -04:00
Rob Speer
ed32b278cc make an example clearer with wordlist='large' 2016-03-30 15:08:32 -04:00
Rob Speer
a10c1d7ac0 update wordlists for new builder settings 2016-03-28 12:26:47 -04:00
Rob Speer
abbc295538 Discard text detected as an uncommon language; add large German list 2016-03-28 12:26:02 -04:00
Rob Speer
08130908c7 oh look, more spam 2016-03-24 18:42:47 -04:00
Rob Speer
5b98794b86 filter out downvoted Reddit posts 2016-03-24 18:05:13 -04:00
Rob Speer
cfe68893fa disregard Arabic Reddit spam 2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381 fix extraneous dot in intermediate filenames 2016-03-24 16:52:44 -04:00
Rob Speer
1df97a579e bump version to 1.4 2016-03-24 16:29:29 -04:00
Rob Speer
75a4a92110 actually use the results of language-detection on Reddit 2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05 Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494 make max-words a real, documented parameter 2016-03-24 14:10:02 -04:00
Rob Speer
7b539f9057 Merge pull request #33 from LuminosoInsight/bugfix
Restore a missing comma.
2016-03-24 13:59:50 -04:00
Andrew Lin
38016cf62b Restore a missing comma. 2016-03-24 13:57:18 -04:00
Andrew Lin
84497429e1 Merge pull request #32 from LuminosoInsight/thai-fix
Leave Thai segments alone in the default regex
2016-03-10 11:57:44 -05:00
Rob Speer
4ec6b56faa move Thai test to where it makes more sense 2016-03-10 11:56:15 -05:00
Rob Speer
07f16e6f03 Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
Rob Speer
d79ee37da9 Add and document large wordlists 2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec configuration that builds some larger lists 2016-01-22 14:20:12 -05:00
Rob Speer
9907948d11 add Zipf scale 2016-01-21 14:07:01 -05:00
slibs63
d18fee3d78 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca fix documentation in wordfreq_builder.tokenizers 2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91 reformat some argparse argument definitions 2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d build a bigger wordlist that we can optionally use 2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab fix usage text: one comment, not one tweet 2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e Separate tokens with spaces, not line breaks, in intermediate files 2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0 Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93 Specify encoding when dealing with files 2015-12-23 15:49:13 -05:00
Rob Speer
973caca253 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb gzip the intermediate step of Reddit word counting 2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b no Thai because we can't tokenize it 2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b forgot about Italian 2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d add tokenizer for Reddit 2015-11-30 18:16:54 -05:00
Rob Speer
2dcf368481 rebuild data files 2015-11-30 17:06:39 -05:00
Rob Speer
b2d7546d2d add word frequencies from the Reddit 2007-2015 corpus 2015-11-30 16:38:11 -05:00
Rob Speer
e1f7a1ccf3 add docstrings to chinese_ and japanese_tokenize 2015-10-27 13:23:56 -04:00
Lance Nathan
ca00dfa1d9 Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options
2015-10-19 18:21:52 -04:00
Rob Speer
a6b6aa07e7 Define globals in relevant places 2015-10-19 18:15:54 -04:00
Rob Speer
bfc17fea9f clarify the tokenize docstring 2015-10-19 12:18:12 -04:00
Rob Speer
1793c1bb2e Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py
2015-09-28 14:34:59 -04:00