Commit Graph

168 Commits

Author SHA1 Message Date
Robyn Speer
0c7527140c Discard text detected as an uncommon language; add large German list
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Robyn Speer
aa7802b552 oh look, more spam
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Robyn Speer
2840ca55aa filter out downvoted Reddit posts
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Robyn Speer
16841d4b0c disregard Arabic Reddit spam
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Robyn Speer
034d8f540b fix extraneous dot in intermediate filenames
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Robyn Speer
969a024dea actually use the results of language-detection on Reddit
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
fbc19995ab Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Robyn Speer
f493d0eec4 make max-words a real, documented parameter
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Andrew Lin
1942bc690f Restore a missing comma.
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Robyn Speer
6344b38194 Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
12e779fc79 configuration that builds some larger lists
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Robyn Speer
6eca3cff5a fix documentation in wordfreq_builder.tokenizers
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Robyn Speer
95cdf41fe8 reformat some argparse argument definitions
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Robyn Speer
738243e244 build a bigger wordlist that we can optionally use
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Robyn Speer
2069e30c89 fix usage text: one comment, not one tweet
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Robyn Speer
883aa5baeb Separate tokens with spaces, not line breaks, in intermediate files
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Robyn Speer
7d1719cfb4 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.


Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
682e08fee2 no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Robyn Speer
064ee22a33 forgot about Italian
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Robyn Speer
ab8c2e2331 add tokenizer for Reddit
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Robyn Speer
6d2709f064 add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
7494ae27a7 fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
d215f79ea3 describe the use of lang in read_values
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03 Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
f2be213933 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
c5d5b0b1fe In ninja deps, remove 'startrow' as a variable
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05 fix spelling of Marc
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
1d3521dfda take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44 update comments in wordfreq_builder.config; remove unused 'version'
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c sort Jieba wordlists consistently; update data files
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.


Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1 actually fix logic of apostrophe-fixing
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920 fix logic of apostrophe-fixing
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db fix '--language' option definition
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700 Avoid Chinese tokenizer when building
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28 language-specific frequency reading; fix 't in English
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py

Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab WIP: fix apostrophe trimming
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
5b9b2d2d02 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
a75a95658b We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
8277b34571 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
a69b66b210 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
d0ada70355 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
14136d2a01 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
574c383202 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00