Commit Graph

169 Commits

Author SHA1 Message Date
Rob Speer
16059d3b9a rename max_size to max_words consistently 2016-03-31 12:55:18 -04:00
Rob Speer
abbc295538 Discard text detected as an uncommon language; add large German list 2016-03-28 12:26:02 -04:00
Rob Speer
08130908c7 oh look, more spam 2016-03-24 18:42:47 -04:00
Rob Speer
5b98794b86 filter out downvoted Reddit posts 2016-03-24 18:05:13 -04:00
Rob Speer
cfe68893fa disregard Arabic Reddit spam 2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381 fix extraneous dot in intermediate filenames 2016-03-24 16:52:44 -04:00
Rob Speer
75a4a92110 actually use the results of language-detection on Reddit 2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05 Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494 make max-words a real, documented parameter 2016-03-24 14:10:02 -04:00
Andrew Lin
38016cf62b Restore a missing comma. 2016-03-24 13:57:18 -04:00
Rob Speer
d79ee37da9 Add and document large wordlists 2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec configuration that builds some larger lists 2016-01-22 14:20:12 -05:00
Rob Speer
8ddc19a5ca fix documentation in wordfreq_builder.tokenizers 2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91 reformat some argparse argument definitions 2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d build a bigger wordlist that we can optionally use 2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab fix usage text: one comment, not one tweet 2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e Separate tokens with spaces, not line breaks, in intermediate files 2016-01-12 12:59:18 -05:00
Rob Speer
973caca253 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb gzip the intermediate step of Reddit word counting 2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b no Thai because we can't tokenize it 2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b forgot about Italian 2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d add tokenizer for Reddit 2015-11-30 18:16:54 -05:00
Rob Speer
b2d7546d2d add word frequencies from the Reddit 2007-2015 corpus 2015-11-30 16:38:11 -05:00
Rob Speer
9b1c4d66cd fix missing word in rules.ninja comment 2015-09-24 17:56:06 -04:00
Rob Speer
f224b8dbba describe the use of lang in read_values 2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1 Make the jieba_deps comment make sense 2015-09-22 17:19:00 -04:00
Rob Speer
3cb3061e06 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
a4f8d11427 In ninja deps, remove 'startrow' as a variable 2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116 fix spelling of Marc 2015-09-09 13:35:02 -04:00
Rob Speer
354555514f fixes based on code review notes 2015-09-09 13:10:18 -04:00
Rob Speer
d9c44d5fcc take out OpenSubtitles for Chinese 2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf update comments in wordfreq_builder.config; remove unused 'version' 2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28 sort Jieba wordlists consistently; update data files 2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d actually fix logic of apostrophe-fixing 2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213 fix logic of apostrophe-fixing 2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7 fix '--language' option definition 2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b Avoid Chinese tokenizer when building 2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33 language-specific frequency reading; fix 't in English 2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b WIP: fix apostrophe trimming 2015-09-08 12:28:28 -04:00
Rob Speer
2327f2e4d6 tokenize Chinese using jieba and our own frequencies 2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea WIP: Traditional Chinese 2015-09-04 18:52:37 -04:00
Rob Speer
447d7e5134 add Polish and Swedish, which have sufficient data 2015-09-04 17:10:40 -04:00
Rob Speer
5c7a7ea83e We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3 remove subtlex-gr from README 2015-09-04 16:11:46 -04:00
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
0d3ee869c1 Exclude angle brackets from CLD2 detection 2015-09-04 14:56:06 -04:00
Rob Speer
34474939f2 add more SUBTLEX and fix its build rules 2015-09-04 12:37:35 -04:00
Rob Speer
531db64288 Note on next languages to support 2015-09-04 01:50:15 -04:00