Commit Graph

113 Commits

Author SHA1 Message Date
Robyn Speer
af29fc4f88 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
3a140ee02f un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
b22a4b0f02 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
8637aaef9e use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
4ec128adae remove Hangul fillers that confuse cld2
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
77610f57e1 Stylistic cleanups to word_counts.py.
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
0711fb3c43 Remove redundant reference to wikipedia in builder README.
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Robyn Speer
e9dd253f1d Don't use the file-reading cutoff when writing centibels
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Robyn Speer
3ff0f30218 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
33e0493fd5 Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17

Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
292fc96142 updated read_freqs docs
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
d629e8b6cc fixed style
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
f9742c94ca reordered command line args
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
474ae0da35 bugfix
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
34504eed80 fixed rules.ninja
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
61a03b87bc fixed build bug
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
af8050f1b8 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
675a02ac11 unhoisted if statement
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
98cbef4ecf ninja.py is now pep8 compliant
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
44669bd3a9 fixed build
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Robyn Speer
ea2c6adbc4 mention the Wikipedia data, and credit Hermit Dave
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
ec871bb6ca fixed tokenize_twitter
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
71ff0c62d6 added cld2 tokenizer comments
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
c2f3928433 fix arabic tokens
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
d283183743 fixed syntax
Former-commit-id: c75c735d8d
2015-07-17 15:43:24 -04:00
Joshua Chin
3962b475c1 renamed tokenize file to tokenize twitter
Former-commit-id: 303bd88ba2
2015-07-17 15:27:26 -04:00
Joshua Chin
2f73cc535c created last_tab flag
Former-commit-id: d6519cf736
2015-07-17 15:19:09 -04:00
Joshua Chin
4117480a0e removed uncessary if statement
Former-commit-id: 620becb7e8
2015-07-17 15:14:06 -04:00
Joshua Chin
b6d03324b9 generated freq dict in place
Former-commit-id: d988b1b42e
2015-07-17 15:13:25 -04:00
Joshua Chin
c0650a6893 corrected docstring
Former-commit-id: e37c689031
2015-07-17 15:12:23 -04:00
Joshua Chin
be8921869a removed unnecessary strip
Former-commit-id: 002351bace
2015-07-17 15:11:28 -04:00
Joshua Chin
d7feab1c28 moved last_tab to tokenize_twitter
Former-commit-id: 7fc23666a9
2015-07-17 15:10:17 -04:00
Joshua Chin
200c271083 removed unused function
Former-commit-id: 528285a982
2015-07-17 15:03:14 -04:00
Joshua Chin
c84ac8d62a fixed spacing
Former-commit-id: 59d3c72758
2015-07-17 15:02:34 -04:00
Joshua Chin
2258c4e55b removed unnecessary format
Former-commit-id: 10028be212
2015-07-17 15:01:25 -04:00
Joshua Chin
368e4f3cca cleaned up BAD_CHAR_RANGE
Former-commit-id: 3b368b66dd
2015-07-17 15:00:59 -04:00
Joshua Chin
78e9cf5d8f moved test tokenizers
Former-commit-id: c2d1cdcb31
2015-07-17 14:58:58 -04:00
Joshua Chin
4bfdd263b7 added docstring and moved to scripts
Former-commit-id: 5d26c9f57f
2015-07-17 14:56:18 -04:00
Joshua Chin
09ccb862ba style changes
Former-commit-id: bdc791af8f
2015-07-17 14:54:32 -04:00
Joshua Chin
85fe540a06 removed bad comment
Former-commit-id: 4d5ec57144
2015-07-17 14:54:09 -04:00
Joshua Chin
eb9add9d71 removed unused scripts
Former-commit-id: 39f01b0485
2015-07-17 14:53:18 -04:00
Joshua Chin
a340a15870 removed mkdir -p for many cases
Former-commit-id: 98a7a8093b
2015-07-17 14:45:22 -04:00
Joshua Chin
354f09ec24 removed TOKENIZE_TWITTER
Former-commit-id: 449a656edd
2015-07-17 14:43:14 -04:00
Joshua Chin
d0df4cc9a4 removed TOKENIZE_TWITTER option
Former-commit-id: 00e18b7d4b
2015-07-17 14:40:49 -04:00
Joshua Chin
46b2730601 more README fixes
Former-commit-id: 772c0cddd1
2015-07-17 14:40:33 -04:00
Joshua Chin
3e4643f9c4 fixed README
Former-commit-id: 0a085132f4
2015-07-17 14:35:43 -04:00
Robyn Speer
73bacc659d update the wordfreq_builder README
Former-commit-id: 8633e8c2a9
2015-07-13 11:58:48 -04:00
Robyn Speer
e9d88bf35e add docstrings and remove some brackets 2015-07-07 18:22:51 -04:00
Joshua Chin
1c365e6a50 Removes mention of Rosette from README 2015-07-07 10:32:16 -04:00
Robyn Speer
3eb3e7c388 add 'twitter' as a final build, and a new build dir
The `data/dist` directory is now a convenient place to find the final
built files that can be copied into wordfreq.
2015-07-01 17:45:39 -04:00