Commit Graph

156 Commits

Author SHA1 Message Date
Rob Speer
8ddc19a5ca fix documentation in wordfreq_builder.tokenizers 2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91 reformat some argparse argument definitions 2016-01-13 12:05:07 -05:00
Rob Speer
8d9668d8ab fix usage text: one comment, not one tweet 2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e Separate tokens with spaces, not line breaks, in intermediate files 2016-01-12 12:59:18 -05:00
Rob Speer
973caca253 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb gzip the intermediate step of Reddit word counting 2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b no Thai because we can't tokenize it 2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b forgot about Italian 2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d add tokenizer for Reddit 2015-11-30 18:16:54 -05:00
Rob Speer
b2d7546d2d add word frequencies from the Reddit 2007-2015 corpus 2015-11-30 16:38:11 -05:00
Rob Speer
9b1c4d66cd fix missing word in rules.ninja comment 2015-09-24 17:56:06 -04:00
Rob Speer
f224b8dbba describe the use of lang in read_values 2015-09-22 17:22:38 -04:00
Rob Speer
7c12f2aca1 Make the jieba_deps comment make sense 2015-09-22 17:19:00 -04:00
Rob Speer
3cb3061e06 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py
2015-09-10 15:27:33 -04:00
Rob Speer
a4f8d11427 In ninja deps, remove 'startrow' as a variable 2015-09-10 13:46:19 -04:00
Rob Speer
2277ad3116 fix spelling of Marc 2015-09-09 13:35:02 -04:00
Rob Speer
354555514f fixes based on code review notes 2015-09-09 13:10:18 -04:00
Rob Speer
d9c44d5fcc take out OpenSubtitles for Chinese 2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf update comments in wordfreq_builder.config; remove unused 'version' 2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28 sort Jieba wordlists consistently; update data files 2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d actually fix logic of apostrophe-fixing 2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213 fix logic of apostrophe-fixing 2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7 fix '--language' option definition 2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b Avoid Chinese tokenizer when building 2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33 language-specific frequency reading; fix 't in English 2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b WIP: fix apostrophe trimming 2015-09-08 12:28:28 -04:00
Rob Speer
2327f2e4d6 tokenize Chinese using jieba and our own frequencies 2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea WIP: Traditional Chinese 2015-09-04 18:52:37 -04:00
Rob Speer
447d7e5134 add Polish and Swedish, which have sufficient data 2015-09-04 17:10:40 -04:00
Rob Speer
5c7a7ea83e We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3 remove subtlex-gr from README 2015-09-04 16:11:46 -04:00
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
0d3ee869c1 Exclude angle brackets from CLD2 detection 2015-09-04 14:56:06 -04:00
Rob Speer
34474939f2 add more SUBTLEX and fix its build rules 2015-09-04 12:37:35 -04:00
Rob Speer
531db64288 Note on next languages to support 2015-09-04 01:50:15 -04:00
Rob Speer
d94428d454 support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815 Merge branch 'add-subtlex' into greek-and-turkish 2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba refer to merge_freqs command correctly 2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb expand Greek and enable Turkish in config 2015-09-03 23:23:31 -04:00
Rob Speer
2d58ba94f2 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Rob Speer
5def3a7897 update the build diagram and its script 2015-08-28 17:47:04 -04:00
Rob Speer
c4a2594217 fix URL expression 2015-08-26 15:00:46 -04:00
Rob Speer
a893823d6e un-flake wordfreq_builder.tokenizers, and edit docstrings 2015-08-26 13:03:23 -04:00
Rob Speer
5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
de73888a76 use better regexes in wordfreq_builder tokenizer 2015-08-24 19:05:46 -04:00
Rob Speer
140ca6c050 remove Hangul fillers that confuse cld2 2015-08-24 17:11:18 -04:00
Andrew Lin
6d40912ef9 Stylistic cleanups to word_counts.py. 2015-07-31 19:26:18 -04:00
Andrew Lin
53621c34df Remove redundant reference to wikipedia in builder README. 2015-07-31 19:12:59 -04:00