Rob Speer
cb5b696ffa
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Rob Speer
4aac7bdd65
update the build diagram and its script
...
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
49bd631632
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
40d6b85d67
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
a3b37f6619
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
6647cf9035
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Rob Speer
6a33b46cfd
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
581dcbcae5
Stylistic cleanups to word_counts.py.
...
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
f393086253
Remove redundant reference to wikipedia in builder README.
...
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Rob Speer
0f0aca8320
Don't use the file-reading cutoff when writing centibels
...
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Rob Speer
4350bc3ed7
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
b537f4ecfb
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
8004ecb790
updated read_freqs docs
...
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
0d8bf35fab
fixed style
...
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
78324e74eb
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
6f47f76458
bugfix
...
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
0a2f2877af
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
c1f56f5c96
fixed build bug
...
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
423b2d8443
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
efe7bc3720
unhoisted if statement
...
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
b5a358012b
ninja.py is now pep8 compliant
...
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
a3880608b9
fixed build
...
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Rob Speer
176223bd5d
mention the Wikipedia data, and credit Hermit Dave
...
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
c3a14a8a09
fixed tokenize_twitter
...
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
af73f813be
added cld2 tokenizer comments
...
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
5c7e0dd0dd
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
a868c99839
fixed syntax
...
Former-commit-id: c75c735d8d
2015-07-17 15:43:24 -04:00
Joshua Chin
f2546d8d33
renamed tokenize file to tokenize twitter
...
Former-commit-id: 303bd88ba2
2015-07-17 15:27:26 -04:00
Joshua Chin
d3a5191fb0
created last_tab flag
...
Former-commit-id: d6519cf736
2015-07-17 15:19:09 -04:00
Joshua Chin
4b81f8c938
removed uncessary if statement
...
Former-commit-id: 620becb7e8
2015-07-17 15:14:06 -04:00
Joshua Chin
9812a2a08c
generated freq dict in place
...
Former-commit-id: d988b1b42e
2015-07-17 15:13:25 -04:00
Joshua Chin
53dd1e91c5
corrected docstring
...
Former-commit-id: e37c689031
2015-07-17 15:12:23 -04:00
Joshua Chin
bb706b65f4
removed unnecessary strip
...
Former-commit-id: 002351bace
2015-07-17 15:11:28 -04:00
Joshua Chin
919f2f5912
moved last_tab to tokenize_twitter
...
Former-commit-id: 7fc23666a9
2015-07-17 15:10:17 -04:00
Joshua Chin
4e87458242
removed unused function
...
Former-commit-id: 528285a982
2015-07-17 15:03:14 -04:00
Joshua Chin
8dd4ffee8a
fixed spacing
...
Former-commit-id: 59d3c72758
2015-07-17 15:02:34 -04:00
Joshua Chin
09dff0186c
removed unnecessary format
...
Former-commit-id: 10028be212
2015-07-17 15:01:25 -04:00
Joshua Chin
117e06d5a4
cleaned up BAD_CHAR_RANGE
...
Former-commit-id: 3b368b66dd
2015-07-17 15:00:59 -04:00
Joshua Chin
cc2f748b05
moved test tokenizers
...
Former-commit-id: c2d1cdcb31
2015-07-17 14:58:58 -04:00
Joshua Chin
2180f71296
added docstring and moved to scripts
...
Former-commit-id: 5d26c9f57f
2015-07-17 14:56:18 -04:00
Joshua Chin
2335369f86
style changes
...
Former-commit-id: bdc791af8f
2015-07-17 14:54:32 -04:00
Joshua Chin
6083219fe5
removed bad comment
...
Former-commit-id: 4d5ec57144
2015-07-17 14:54:09 -04:00
Joshua Chin
4fa4060036
removed unused scripts
...
Former-commit-id: 39f01b0485
2015-07-17 14:53:18 -04:00
Joshua Chin
631a5f1b71
removed mkdir -p for many cases
...
Former-commit-id: 98a7a8093b
2015-07-17 14:45:22 -04:00
Joshua Chin
bc4cedf85a
removed TOKENIZE_TWITTER
...
Former-commit-id: 449a656edd
2015-07-17 14:43:14 -04:00
Joshua Chin
c80943c677
removed TOKENIZE_TWITTER option
...
Former-commit-id: 00e18b7d4b
2015-07-17 14:40:49 -04:00
Joshua Chin
753d241b6a
more README fixes
...
Former-commit-id: 772c0cddd1
2015-07-17 14:40:33 -04:00
Joshua Chin
0f92367e3d
fixed README
...
Former-commit-id: 0a085132f4
2015-07-17 14:35:43 -04:00
Rob Speer
7f9b7bb5d0
update the wordfreq_builder README
...
Former-commit-id: 8633e8c2a9
2015-07-13 11:58:48 -04:00
Rob Speer
41dba74da2
add docstrings and remove some brackets
2015-07-07 18:22:51 -04:00