Rob Speer
5a1fc00aaa
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068
exclude 'extenders' from the start of the token
2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27
update frequency lists
2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c
Exclude math and modifier symbols as tokens
2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76
use better regexes in wordfreq_builder tokenizer
2015-08-24 19:05:46 -04:00
Rob Speer
554455699d
also NFKC-normalize Japanese input
2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c
only NFKC-normalize in Arabic
2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050
remove Hangul fillers that confuse cld2
2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae
remove obsolete gen_regex.py
2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad
Use the regex implementation of Unicode segmentation
2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1
Merge pull request #21 from LuminosoInsight/review-notes
...
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8
Document the NFKC-normalized ligature in the Arabic test.
2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9
Stylistic cleanups to word_counts.py.
2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac
Switch to more explanatory Unicode escapes when testing NFKC normalization.
2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df
Remove redundant reference to wikipedia in builder README.
2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374
Merge pull request #20 from LuminosoInsight/cutoff-fix
...
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36
Don't use the file-reading cutoff when writing centibels
2015-07-28 18:45:26 -04:00
Rob Speer
eb4b3cad50
update wordlists with cutoff fix
2015-07-28 18:03:12 -04:00
Rob Speer
c5708b24e4
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-28 18:01:12 -04:00
Rob Speer
32102ba3c2
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
2015-07-22 15:09:00 -04:00
Joshua Chin
93cd902899
updated read_freqs docs
2015-07-22 10:06:16 -04:00
Joshua Chin
4fe9d110e1
fixed style
2015-07-22 10:05:11 -04:00
Joshua Chin
6453d864c4
reordered command line args
2015-07-22 10:04:14 -04:00
Joshua Chin
be29243cec
added updated wordfreq data
2015-07-21 10:32:53 -04:00
Joshua Chin
8081145922
bugfix
2015-07-21 10:12:56 -04:00
Joshua Chin
c5f82ecac1
fixed rules.ninja
2015-07-20 17:20:29 -04:00
Joshua Chin
643571c69c
fixed build bug
2015-07-20 16:51:25 -04:00
Joshua Chin
173278fdd3
ensure removal of tatweels (hopefully)
2015-07-20 16:48:36 -04:00
Joshua Chin
298d3c1d24
unhoisted if statement
2015-07-20 11:10:41 -04:00
Joshua Chin
accb7e398c
ninja.py is now pep8 compliant
2015-07-20 11:06:58 -04:00
Joshua Chin
c70ddf00ea
made single line docstring single line
2015-07-20 10:29:02 -04:00
Joshua Chin
01b286e801
updated word_frequency docstring for Chinese
2015-07-20 10:28:11 -04:00
Joshua Chin
465afb854c
updated datafiles
2015-07-20 10:05:27 -04:00
Joshua Chin
221acf7921
fixed build
2015-07-17 17:44:01 -04:00
Rob Speer
2d1020daac
mention the Wikipedia data, and credit Hermit Dave
2015-07-17 17:09:36 -04:00
Joshua Chin
f31f9a1bcd
fixed tokenize_twitter
2015-07-17 16:37:47 -04:00
Joshua Chin
a44927e98e
added cld2 tokenizer comments
2015-07-17 16:03:33 -04:00
Joshua Chin
11a1c51321
fix arabic tokens
2015-07-17 15:52:12 -04:00
Joshua Chin
c75c735d8d
fixed syntax
2015-07-17 15:43:24 -04:00
Joshua Chin
4e3a5263c3
factored out fixing arabic
2015-07-17 15:39:12 -04:00
Joshua Chin
303bd88ba2
renamed tokenize file to tokenize twitter
2015-07-17 15:27:26 -04:00
Joshua Chin
d6519cf736
created last_tab flag
2015-07-17 15:19:09 -04:00
Joshua Chin
620becb7e8
removed uncessary if statement
2015-07-17 15:14:06 -04:00
Joshua Chin
d988b1b42e
generated freq dict in place
2015-07-17 15:13:25 -04:00
Joshua Chin
e37c689031
corrected docstring
2015-07-17 15:12:23 -04:00
Joshua Chin
002351bace
removed unnecessary strip
2015-07-17 15:11:28 -04:00
Joshua Chin
7fc23666a9
moved last_tab to tokenize_twitter
2015-07-17 15:10:17 -04:00
Joshua Chin
528285a982
removed unused function
2015-07-17 15:03:14 -04:00
Joshua Chin
59d3c72758
fixed spacing
2015-07-17 15:02:34 -04:00
Joshua Chin
10028be212
removed unnecessary format
2015-07-17 15:01:25 -04:00