Commit Graph

364 Commits

Author SHA1 Message Date
Rob Speer
5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068 exclude 'extenders' from the start of the token 2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27 update frequency lists 2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c Exclude math and modifier symbols as tokens 2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76 use better regexes in wordfreq_builder tokenizer 2015-08-24 19:05:46 -04:00
Rob Speer
554455699d also NFKC-normalize Japanese input 2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c only NFKC-normalize in Arabic 2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050 remove Hangul fillers that confuse cld2 2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae remove obsolete gen_regex.py 2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1 Merge pull request #21 from LuminosoInsight/review-notes
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8 Document the NFKC-normalized ligature in the Arabic test. 2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9 Stylistic cleanups to word_counts.py. 2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac Switch to more explanatory Unicode escapes when testing NFKC normalization. 2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df Remove redundant reference to wikipedia in builder README. 2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374 Merge pull request #20 from LuminosoInsight/cutoff-fix
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36 Don't use the file-reading cutoff when writing centibels 2015-07-28 18:45:26 -04:00
Rob Speer
eb4b3cad50 update wordlists with cutoff fix 2015-07-28 18:03:12 -04:00
Rob Speer
c5708b24e4 put back the freqs_to_cBpack cutoff; prepare for 1.0 2015-07-28 18:01:12 -04:00
Rob Speer
32102ba3c2 Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17
2015-07-22 15:09:00 -04:00
Joshua Chin
93cd902899 updated read_freqs docs 2015-07-22 10:06:16 -04:00
Joshua Chin
4fe9d110e1 fixed style 2015-07-22 10:05:11 -04:00
Joshua Chin
6453d864c4 reordered command line args 2015-07-22 10:04:14 -04:00
Joshua Chin
be29243cec added updated wordfreq data 2015-07-21 10:32:53 -04:00
Joshua Chin
8081145922 bugfix 2015-07-21 10:12:56 -04:00
Joshua Chin
c5f82ecac1 fixed rules.ninja 2015-07-20 17:20:29 -04:00
Joshua Chin
643571c69c fixed build bug 2015-07-20 16:51:25 -04:00
Joshua Chin
173278fdd3 ensure removal of tatweels (hopefully) 2015-07-20 16:48:36 -04:00
Joshua Chin
298d3c1d24 unhoisted if statement 2015-07-20 11:10:41 -04:00
Joshua Chin
accb7e398c ninja.py is now pep8 compliant 2015-07-20 11:06:58 -04:00
Joshua Chin
c70ddf00ea made single line docstring single line 2015-07-20 10:29:02 -04:00
Joshua Chin
01b286e801 updated word_frequency docstring for Chinese 2015-07-20 10:28:11 -04:00
Joshua Chin
465afb854c updated datafiles 2015-07-20 10:05:27 -04:00
Joshua Chin
221acf7921 fixed build 2015-07-17 17:44:01 -04:00
Rob Speer
2d1020daac mention the Wikipedia data, and credit Hermit Dave 2015-07-17 17:09:36 -04:00
Joshua Chin
f31f9a1bcd fixed tokenize_twitter 2015-07-17 16:37:47 -04:00
Joshua Chin
a44927e98e added cld2 tokenizer comments 2015-07-17 16:03:33 -04:00
Joshua Chin
11a1c51321 fix arabic tokens 2015-07-17 15:52:12 -04:00
Joshua Chin
c75c735d8d fixed syntax 2015-07-17 15:43:24 -04:00
Joshua Chin
4e3a5263c3 factored out fixing arabic 2015-07-17 15:39:12 -04:00
Joshua Chin
303bd88ba2 renamed tokenize file to tokenize twitter 2015-07-17 15:27:26 -04:00
Joshua Chin
d6519cf736 created last_tab flag 2015-07-17 15:19:09 -04:00
Joshua Chin
620becb7e8 removed uncessary if statement 2015-07-17 15:14:06 -04:00
Joshua Chin
d988b1b42e generated freq dict in place 2015-07-17 15:13:25 -04:00
Joshua Chin
e37c689031 corrected docstring 2015-07-17 15:12:23 -04:00
Joshua Chin
002351bace removed unnecessary strip 2015-07-17 15:11:28 -04:00
Joshua Chin
7fc23666a9 moved last_tab to tokenize_twitter 2015-07-17 15:10:17 -04:00
Joshua Chin
528285a982 removed unused function 2015-07-17 15:03:14 -04:00
Joshua Chin
59d3c72758 fixed spacing 2015-07-17 15:02:34 -04:00
Joshua Chin
10028be212 removed unnecessary format 2015-07-17 15:01:25 -04:00