Commit Graph

126 Commits

Author SHA1 Message Date
Rob Speer
a555e5dc13 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
0441a81bbe We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
c08e593234 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
3a8b2c2c81 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Rob Speer
b1d158ab41 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
25e24f9c32 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Rob Speer
a6ef3224a6 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
89763679de Merge branch 'add-subtlex' into greek-and-turkish
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Rob Speer
ad4b12bee9 refer to merge_freqs command correctly
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
7a2f2035ab expand Greek and enable Turkish in config
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Rob Speer
cb5b696ffa Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.


Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Rob Speer
4aac7bdd65 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
49bd631632 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
40d6b85d67 un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
a3b37f6619 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
6647cf9035 use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Rob Speer
6a33b46cfd remove Hangul fillers that confuse cld2
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
581dcbcae5 Stylistic cleanups to word_counts.py.
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
f393086253 Remove redundant reference to wikipedia in builder README.
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Rob Speer
0f0aca8320 Don't use the file-reading cutoff when writing centibels
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Rob Speer
4350bc3ed7 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
b537f4ecfb Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17

Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
8004ecb790 updated read_freqs docs
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
0d8bf35fab fixed style
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
78324e74eb reordered command line args
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
6f47f76458 bugfix
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
0a2f2877af fixed rules.ninja
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
c1f56f5c96 fixed build bug
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
423b2d8443 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
efe7bc3720 unhoisted if statement
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
b5a358012b ninja.py is now pep8 compliant
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
a3880608b9 fixed build
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Rob Speer
176223bd5d mention the Wikipedia data, and credit Hermit Dave
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
c3a14a8a09 fixed tokenize_twitter
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
af73f813be added cld2 tokenizer comments
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
5c7e0dd0dd fix arabic tokens
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
a868c99839 fixed syntax
Former-commit-id: c75c735d8d
2015-07-17 15:43:24 -04:00
Joshua Chin
f2546d8d33 renamed tokenize file to tokenize twitter
Former-commit-id: 303bd88ba2
2015-07-17 15:27:26 -04:00
Joshua Chin
d3a5191fb0 created last_tab flag
Former-commit-id: d6519cf736
2015-07-17 15:19:09 -04:00
Joshua Chin
4b81f8c938 removed uncessary if statement
Former-commit-id: 620becb7e8
2015-07-17 15:14:06 -04:00
Joshua Chin
9812a2a08c generated freq dict in place
Former-commit-id: d988b1b42e
2015-07-17 15:13:25 -04:00
Joshua Chin
53dd1e91c5 corrected docstring
Former-commit-id: e37c689031
2015-07-17 15:12:23 -04:00
Joshua Chin
bb706b65f4 removed unnecessary strip
Former-commit-id: 002351bace
2015-07-17 15:11:28 -04:00
Joshua Chin
919f2f5912 moved last_tab to tokenize_twitter
Former-commit-id: 7fc23666a9
2015-07-17 15:10:17 -04:00
Joshua Chin
4e87458242 removed unused function
Former-commit-id: 528285a982
2015-07-17 15:03:14 -04:00
Joshua Chin
8dd4ffee8a fixed spacing
Former-commit-id: 59d3c72758
2015-07-17 15:02:34 -04:00
Joshua Chin
09dff0186c removed unnecessary format
Former-commit-id: 10028be212
2015-07-17 15:01:25 -04:00
Joshua Chin
117e06d5a4 cleaned up BAD_CHAR_RANGE
Former-commit-id: 3b368b66dd
2015-07-17 15:00:59 -04:00
Joshua Chin
cc2f748b05 moved test tokenizers
Former-commit-id: c2d1cdcb31
2015-07-17 14:58:58 -04:00