Robyn Speer
a75a95658b
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
8277b34571
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
a69b66b210
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
d0ada70355
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
14136d2a01
Note on next languages to support
...
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
574c383202
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
f168c37417
Merge branch 'add-subtlex' into greek-and-turkish
...
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Robyn Speer
76c751652e
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
3446a393c5
expand Greek and enable Turkish in config
...
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Robyn Speer
f66d03b1b9
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Robyn Speer
247d7c6579
update the build diagram and its script
...
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Robyn Speer
af29fc4f88
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
3a140ee02f
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
b22a4b0f02
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
8637aaef9e
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
4ec128adae
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
77610f57e1
Stylistic cleanups to word_counts.py.
...
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
0711fb3c43
Remove redundant reference to wikipedia in builder README.
...
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Robyn Speer
e9dd253f1d
Don't use the file-reading cutoff when writing centibels
...
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Robyn Speer
3ff0f30218
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
33e0493fd5
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
292fc96142
updated read_freqs docs
...
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
d629e8b6cc
fixed style
...
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
f9742c94ca
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
474ae0da35
bugfix
...
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
34504eed80
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
61a03b87bc
fixed build bug
...
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
af8050f1b8
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
675a02ac11
unhoisted if statement
...
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
98cbef4ecf
ninja.py is now pep8 compliant
...
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
44669bd3a9
fixed build
...
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Robyn Speer
ea2c6adbc4
mention the Wikipedia data, and credit Hermit Dave
...
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
ec871bb6ca
fixed tokenize_twitter
...
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
71ff0c62d6
added cld2 tokenizer comments
...
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
c2f3928433
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
d283183743
fixed syntax
...
Former-commit-id: c75c735d8d
2015-07-17 15:43:24 -04:00
Joshua Chin
3962b475c1
renamed tokenize file to tokenize twitter
...
Former-commit-id: 303bd88ba2
2015-07-17 15:27:26 -04:00
Joshua Chin
2f73cc535c
created last_tab flag
...
Former-commit-id: d6519cf736
2015-07-17 15:19:09 -04:00
Joshua Chin
4117480a0e
removed uncessary if statement
...
Former-commit-id: 620becb7e8
2015-07-17 15:14:06 -04:00
Joshua Chin
b6d03324b9
generated freq dict in place
...
Former-commit-id: d988b1b42e
2015-07-17 15:13:25 -04:00
Joshua Chin
c0650a6893
corrected docstring
...
Former-commit-id: e37c689031
2015-07-17 15:12:23 -04:00
Joshua Chin
be8921869a
removed unnecessary strip
...
Former-commit-id: 002351bace
2015-07-17 15:11:28 -04:00
Joshua Chin
d7feab1c28
moved last_tab to tokenize_twitter
...
Former-commit-id: 7fc23666a9
2015-07-17 15:10:17 -04:00
Joshua Chin
200c271083
removed unused function
...
Former-commit-id: 528285a982
2015-07-17 15:03:14 -04:00
Joshua Chin
c84ac8d62a
fixed spacing
...
Former-commit-id: 59d3c72758
2015-07-17 15:02:34 -04:00
Joshua Chin
2258c4e55b
removed unnecessary format
...
Former-commit-id: 10028be212
2015-07-17 15:01:25 -04:00
Joshua Chin
368e4f3cca
cleaned up BAD_CHAR_RANGE
...
Former-commit-id: 3b368b66dd
2015-07-17 15:00:59 -04:00
Joshua Chin
78e9cf5d8f
moved test tokenizers
...
Former-commit-id: c2d1cdcb31
2015-07-17 14:58:58 -04:00
Joshua Chin
4bfdd263b7
added docstring and moved to scripts
...
Former-commit-id: 5d26c9f57f
2015-07-17 14:56:18 -04:00