Andrew Lin
9fedede771
Merge pull request #22 from LuminosoInsight/standard-tokenizer
...
Use a more standard Unicode tokenizer
Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Robyn Speer
4edfab23ef
update data files
...
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Robyn Speer
2c688b8238
copyedit regex comments
...
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Robyn Speer
0b5d2cdca9
fix typo in docstring
...
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Robyn Speer
af29fc4f88
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
e463397edf
correct the simple_tokenize docstring
...
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Robyn Speer
7fa449729b
refactor the token expression
...
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Robyn Speer
3a140ee02f
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
769d8c627c
remove regex files that are no longer needed
...
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Robyn Speer
6f10e71d29
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Robyn Speer
a3a3180bb9
update the README
...
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Robyn Speer
e3658e0e42
updated data
...
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Robyn Speer
b22a4b0f02
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
0b282c5055
exclude 'extenders' from the start of the token
...
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Robyn Speer
4801b0d876
update frequency lists
...
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Robyn Speer
070c89c00c
Exclude math and modifier symbols as tokens
...
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Robyn Speer
8637aaef9e
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
7bdfb74720
also NFKC-normalize Japanese input
...
Former-commit-id: 554455699d
2015-08-24 18:13:03 -04:00
Robyn Speer
13096b26bd
only NFKC-normalize in Arabic
...
Former-commit-id: 1d055edc1c
2015-08-24 17:55:17 -04:00
Robyn Speer
4ec128adae
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Robyn Speer
3674d35501
remove obsolete gen_regex.py
...
Former-commit-id: 102bc715ae
2015-08-24 17:11:18 -04:00
Robyn Speer
8795525372
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Robyn Speer
e15fc14b8e
Merge pull request #21 from LuminosoInsight/review-notes
...
Review notes
Former-commit-id: 2b8089e2b1
2015-08-03 14:48:15 -04:00
Andrew Lin
e88cf3fdaf
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
77610f57e1
Stylistic cleanups to word_counts.py.
...
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
b0fac15f98
Switch to more explanatory Unicode escapes when testing NFKC normalization.
...
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Andrew Lin
0711fb3c43
Remove redundant reference to wikipedia in builder README.
...
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Andrew Lin
ba565ee838
Merge pull request #20 from LuminosoInsight/cutoff-fix
...
put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: 742e2b3374
2015-07-29 11:43:41 -04:00
Robyn Speer
e9dd253f1d
Don't use the file-reading cutoff when writing centibels
...
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Robyn Speer
0a032dfa97
update wordlists with cutoff fix
...
Former-commit-id: eb4b3cad50
2015-07-28 18:03:12 -04:00
Robyn Speer
3ff0f30218
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
33e0493fd5
Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
...
Code review fixes 2015 07 17
Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
292fc96142
updated read_freqs docs
...
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
d629e8b6cc
fixed style
...
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
f9742c94ca
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
b19bba38ad
added updated wordfreq data
...
Former-commit-id: be29243cec
2015-07-21 10:32:53 -04:00
Joshua Chin
474ae0da35
bugfix
...
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
34504eed80
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
61a03b87bc
fixed build bug
...
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
af8050f1b8
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
675a02ac11
unhoisted if statement
...
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
98cbef4ecf
ninja.py is now pep8 compliant
...
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
3b6b8d3ab1
made single line docstring single line
...
Former-commit-id: c70ddf00ea
2015-07-20 10:29:02 -04:00
Joshua Chin
532b953839
updated word_frequency docstring for Chinese
...
Former-commit-id: 01b286e801
2015-07-20 10:28:11 -04:00
Joshua Chin
360f66bbaf
updated datafiles
...
Former-commit-id: 465afb854c
2015-07-20 10:05:27 -04:00
Joshua Chin
44669bd3a9
fixed build
...
Former-commit-id: 221acf7921
2015-07-17 17:44:01 -04:00
Robyn Speer
ea2c6adbc4
mention the Wikipedia data, and credit Hermit Dave
...
Former-commit-id: 2d1020daac
2015-07-17 17:09:36 -04:00
Joshua Chin
ec871bb6ca
fixed tokenize_twitter
...
Former-commit-id: f31f9a1bcd
2015-07-17 16:37:47 -04:00
Joshua Chin
71ff0c62d6
added cld2 tokenizer comments
...
Former-commit-id: a44927e98e
2015-07-17 16:03:33 -04:00
Joshua Chin
c2f3928433
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00