Rob Speer
49bd631632
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
6286946cc3
correct the simple_tokenize docstring
...
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Rob Speer
232aee9c66
refactor the token expression
...
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Rob Speer
40d6b85d67
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
7a757d9ec9
remove regex files that are no longer needed
...
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Rob Speer
1f5c828642
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
d064fbec7d
update the README
...
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Rob Speer
244735ce4d
updated data
...
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Rob Speer
a3b37f6619
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
1042f87efe
exclude 'extenders' from the start of the token
...
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Rob Speer
a5b8c5a745
update frequency lists
...
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Rob Speer
99a312ce06
Exclude math and modifier symbols as tokens
...
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Rob Speer
6647cf9035
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Rob Speer
decd7dae60
also NFKC-normalize Japanese input
...
Former-commit-id: 554455699d
2015-08-24 18:13:03 -04:00
Rob Speer
9178c6de37
only NFKC-normalize in Arabic
...
Former-commit-id: 1d055edc1c
2015-08-24 17:55:17 -04:00
Rob Speer
6a33b46cfd
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Rob Speer
759a8199fb
remove obsolete gen_regex.py
...
Former-commit-id: 102bc715ae
2015-08-24 17:11:18 -04:00
Rob Speer
f4cf46ab9c
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Rob Speer
0721707d92
Merge pull request #21 from LuminosoInsight/review-notes
...
Review notes
Former-commit-id: 2b8089e2b1
2015-08-03 14:48:15 -04:00
Andrew Lin
10bddfe09f
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
581dcbcae5
Stylistic cleanups to word_counts.py.
...
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
a5553676e4
Switch to more explanatory Unicode escapes when testing NFKC normalization.
...
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Andrew Lin
f393086253
Remove redundant reference to wikipedia in builder README.
...
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Andrew Lin
be7bc11cad
Merge pull request #20 from LuminosoInsight/cutoff-fix
...
put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: 742e2b3374
2015-07-29 11:43:41 -04:00
Rob Speer
0f0aca8320
Don't use the file-reading cutoff when writing centibels
...
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Rob Speer
3892c36b16
update wordlists with cutoff fix
...
Former-commit-id: eb4b3cad50
2015-07-28 18:03:12 -04:00
Rob Speer
4350bc3ed7
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
b537f4ecfb
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
8004ecb790
updated read_freqs docs
...
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
0d8bf35fab
fixed style
...
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
78324e74eb
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
dcde7c8a28
added updated wordfreq data
...
Former-commit-id: be29243cec
2015-07-21 10:32:53 -04:00
Joshua Chin
6f47f76458
bugfix
...
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
0a2f2877af
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
c1f56f5c96
fixed build bug
...
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
423b2d8443
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
efe7bc3720
unhoisted if statement
...
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
b5a358012b
ninja.py is now pep8 compliant
...
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
40ba602c10
made single line docstring single line
...
Former-commit-id: c70ddf00ea
2015-07-20 10:29:02 -04:00
Joshua Chin
b787d9104e
updated word_frequency docstring for Chinese
...
Former-commit-id: 01b286e801
2015-07-20 10:28:11 -04:00
Joshua Chin
8be9d15a80
updated datafiles
...
Former-commit-id: 465afb854c
2015-07-20 10:05:27 -04:00
Joshua Chin
a3880608b9
fixed build
...
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Rob Speer
176223bd5d
mention the Wikipedia data, and credit Hermit Dave
...
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
c3a14a8a09
fixed tokenize_twitter
...
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
af73f813be
added cld2 tokenizer comments
...
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
5c7e0dd0dd
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
a868c99839
fixed syntax
...
Former-commit-id: c75c735d8d
2015-07-17 15:43:24 -04:00
Joshua Chin
8006f80a99
factored out fixing arabic
...
Former-commit-id: 4e3a5263c3
2015-07-17 15:39:12 -04:00
Joshua Chin
f2546d8d33
renamed tokenize file to tokenize twitter
...
Former-commit-id: 303bd88ba2
2015-07-17 15:27:26 -04:00
Joshua Chin
d3a5191fb0
created last_tab flag
...
Former-commit-id: d6519cf736
2015-07-17 15:19:09 -04:00