Rob Speer
cfe68893fa
disregard Arabic Reddit spam
2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381
fix extraneous dot in intermediate filenames
2016-03-24 16:52:44 -04:00
Rob Speer
75a4a92110
actually use the results of language-detection on Reddit
2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494
make max-words a real, documented parameter
2016-03-24 14:10:02 -04:00
Andrew Lin
38016cf62b
Restore a missing comma.
2016-03-24 13:57:18 -04:00
Rob Speer
d79ee37da9
Add and document large wordlists
2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec
configuration that builds some larger lists
2016-01-22 14:20:12 -05:00
Rob Speer
8ddc19a5ca
fix documentation in wordfreq_builder.tokenizers
2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91
reformat some argparse argument definitions
2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d
build a bigger wordlist that we can optionally use
2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab
fix usage text: one comment, not one tweet
2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e
Separate tokens with spaces, not line breaks, in intermediate files
2016-01-12 12:59:18 -05:00
Rob Speer
973caca253
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb
gzip the intermediate step of Reddit word counting
2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b
no Thai because we can't tokenize it
2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b
forgot about Italian
2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d
add tokenizer for Reddit
2015-11-30 18:16:54 -05:00
Rob Speer
b2d7546d2d
add word frequencies from the Reddit 2007-2015 corpus
2015-11-30 16:38:11 -05:00
Rob Speer
9b1c4d66cd
fix missing word in rules.ninja comment
2015-09-24 17:56:06 -04:00
Rob Speer
f224b8dbba
describe the use of lang
in read_values
2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1
Make the jieba_deps comment make sense
2015-09-22 17:19:00 -04:00
Rob Speer
3cb3061e06
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
a4f8d11427
In ninja deps, remove 'startrow' as a variable
2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116
fix spelling of Marc
2015-09-09 13:35:02 -04:00
Rob Speer
354555514f
fixes based on code review notes
2015-09-09 13:10:18 -04:00
Rob Speer
d9c44d5fcc
take out OpenSubtitles for Chinese
2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf
update comments in wordfreq_builder.config; remove unused 'version'
2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28
sort Jieba wordlists consistently; update data files
2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d
actually fix logic of apostrophe-fixing
2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213
fix logic of apostrophe-fixing
2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7
fix '--language' option definition
2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b
Avoid Chinese tokenizer when building
2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33
language-specific frequency reading; fix 't in English
2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b
WIP: fix apostrophe trimming
2015-09-08 12:28:28 -04:00
Rob Speer
2327f2e4d6
tokenize Chinese using jieba and our own frequencies
2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea
WIP: Traditional Chinese
2015-09-04 18:52:37 -04:00
Rob Speer
447d7e5134
add Polish and Swedish, which have sufficient data
2015-09-04 17:10:40 -04:00
Rob Speer
5c7a7ea83e
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3
remove subtlex-gr from README
2015-09-04 16:11:46 -04:00
Rob Speer
77c60c29b0
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
0d3ee869c1
Exclude angle brackets from CLD2 detection
2015-09-04 14:56:06 -04:00
Rob Speer
34474939f2
add more SUBTLEX and fix its build rules
2015-09-04 12:37:35 -04:00
Rob Speer
531db64288
Note on next languages to support
2015-09-04 01:50:15 -04:00
Rob Speer
d94428d454
support Turkish and more Greek; document more
2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815
Merge branch 'add-subtlex' into greek-and-turkish
2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba
refer to merge_freqs command correctly
2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb
expand Greek and enable Turkish in config
2015-09-03 23:23:31 -04:00