Commit Graph

481 Commits

Author SHA1 Message Date
Andrew Lin
b693715663 Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README

Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Rob Speer
d883eaeca5 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181 fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
4aac7bdd65 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
43fd15c938 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
5a47427f6e Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer

Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Rob Speer
db5a4502b8 update data files
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Rob Speer
001180ca86 copyedit regex comments
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Rob Speer
dae953525e fix typo in docstring
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Rob Speer
49bd631632 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
6286946cc3 correct the simple_tokenize docstring
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Rob Speer
232aee9c66 refactor the token expression
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Rob Speer
40d6b85d67 un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
7a757d9ec9 remove regex files that are no longer needed
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Rob Speer
1f5c828642 bump to version 1.1
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
d064fbec7d update the README
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Rob Speer
244735ce4d updated data
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Rob Speer
a3b37f6619 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
1042f87efe exclude 'extenders' from the start of the token
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Rob Speer
a5b8c5a745 update frequency lists
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Rob Speer
99a312ce06 Exclude math and modifier symbols as tokens
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Rob Speer
6647cf9035 use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Rob Speer
decd7dae60 also NFKC-normalize Japanese input
Former-commit-id: 554455699d
2015-08-24 18:13:03 -04:00
Rob Speer
9178c6de37 only NFKC-normalize in Arabic
Former-commit-id: 1d055edc1c
2015-08-24 17:55:17 -04:00
Rob Speer
6a33b46cfd remove Hangul fillers that confuse cld2
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Rob Speer
759a8199fb remove obsolete gen_regex.py
Former-commit-id: 102bc715ae
2015-08-24 17:11:18 -04:00
Rob Speer
f4cf46ab9c Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Rob Speer
0721707d92 Merge pull request #21 from LuminosoInsight/review-notes
Review notes

Former-commit-id: 2b8089e2b1
2015-08-03 14:48:15 -04:00
Andrew Lin
10bddfe09f Document the NFKC-normalized ligature in the Arabic test.
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
581dcbcae5 Stylistic cleanups to word_counts.py.
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
a5553676e4 Switch to more explanatory Unicode escapes when testing NFKC normalization.
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Andrew Lin
f393086253 Remove redundant reference to wikipedia in builder README.
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Andrew Lin
be7bc11cad Merge pull request #20 from LuminosoInsight/cutoff-fix
put back the freqs_to_cBpack cutoff; prepare for 1.0

Former-commit-id: 742e2b3374
2015-07-29 11:43:41 -04:00
Rob Speer
0f0aca8320 Don't use the file-reading cutoff when writing centibels
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Rob Speer
3892c36b16 update wordlists with cutoff fix
Former-commit-id: eb4b3cad50
2015-07-28 18:03:12 -04:00
Rob Speer
4350bc3ed7 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
b537f4ecfb Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17

Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
8004ecb790 updated read_freqs docs
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
0d8bf35fab fixed style
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
78324e74eb reordered command line args
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
dcde7c8a28 added updated wordfreq data
Former-commit-id: be29243cec
2015-07-21 10:32:53 -04:00
Joshua Chin
6f47f76458 bugfix
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
0a2f2877af fixed rules.ninja
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
c1f56f5c96 fixed build bug
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
423b2d8443 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
efe7bc3720 unhoisted if statement
Former-commit-id: 298d3c1d24
2015-07-20 11:10:41 -04:00
Joshua Chin
b5a358012b ninja.py is now pep8 compliant
Former-commit-id: accb7e398c
2015-07-20 11:06:58 -04:00
Joshua Chin
40ba602c10 made single line docstring single line
Former-commit-id: c70ddf00ea
2015-07-20 10:29:02 -04:00
Joshua Chin
b787d9104e updated word_frequency docstring for Chinese
Former-commit-id: 01b286e801
2015-07-20 10:28:11 -04:00
Joshua Chin
8be9d15a80 updated datafiles
Former-commit-id: 465afb854c
2015-07-20 10:05:27 -04:00