Commit Graph

327 Commits

Author SHA1 Message Date
Rob Speer
2370287539 improve README with function documentation and examples 2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203 Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679 update data files 2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e copyedit regex comments 2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef fix typo in docstring 2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217 fix URL expression 2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352 correct the simple_tokenize docstring 2015-08-26 13:54:50 -04:00
Rob Speer
01b6403ef4 refactor the token expression 2015-08-26 13:40:47 -04:00
Rob Speer
a893823d6e un-flake wordfreq_builder.tokenizers, and edit docstrings 2015-08-26 13:03:23 -04:00
Rob Speer
94467a6563 remove regex files that are no longer needed 2015-08-26 11:48:11 -04:00
Rob Speer
694c28d5e4 bump to version 1.1 2015-08-25 17:44:52 -04:00
Rob Speer
573dd1ec79 update the README 2015-08-25 17:44:34 -04:00
Rob Speer
353b8045da updated data 2015-08-25 17:16:03 -04:00
Rob Speer
5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068 exclude 'extenders' from the start of the token 2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27 update frequency lists 2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c Exclude math and modifier symbols as tokens 2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76 use better regexes in wordfreq_builder tokenizer 2015-08-24 19:05:46 -04:00
Rob Speer
554455699d also NFKC-normalize Japanese input 2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c only NFKC-normalize in Arabic 2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050 remove Hangul fillers that confuse cld2 2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae remove obsolete gen_regex.py 2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1 Merge pull request #21 from LuminosoInsight/review-notes
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8 Document the NFKC-normalized ligature in the Arabic test. 2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9 Stylistic cleanups to word_counts.py. 2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac Switch to more explanatory Unicode escapes when testing NFKC normalization. 2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df Remove redundant reference to wikipedia in builder README. 2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374 Merge pull request #20 from LuminosoInsight/cutoff-fix
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36 Don't use the file-reading cutoff when writing centibels 2015-07-28 18:45:26 -04:00
Rob Speer
eb4b3cad50 update wordlists with cutoff fix 2015-07-28 18:03:12 -04:00
Rob Speer
c5708b24e4 put back the freqs_to_cBpack cutoff; prepare for 1.0 2015-07-28 18:01:12 -04:00
Rob Speer
32102ba3c2 Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17
2015-07-22 15:09:00 -04:00
Joshua Chin
93cd902899 updated read_freqs docs 2015-07-22 10:06:16 -04:00
Joshua Chin
4fe9d110e1 fixed style 2015-07-22 10:05:11 -04:00
Joshua Chin
6453d864c4 reordered command line args 2015-07-22 10:04:14 -04:00
Joshua Chin
be29243cec added updated wordfreq data 2015-07-21 10:32:53 -04:00
Joshua Chin
8081145922 bugfix 2015-07-21 10:12:56 -04:00
Joshua Chin
c5f82ecac1 fixed rules.ninja 2015-07-20 17:20:29 -04:00
Joshua Chin
643571c69c fixed build bug 2015-07-20 16:51:25 -04:00
Joshua Chin
173278fdd3 ensure removal of tatweels (hopefully) 2015-07-20 16:48:36 -04:00
Joshua Chin
298d3c1d24 unhoisted if statement 2015-07-20 11:10:41 -04:00
Joshua Chin
accb7e398c ninja.py is now pep8 compliant 2015-07-20 11:06:58 -04:00
Joshua Chin
c70ddf00ea made single line docstring single line 2015-07-20 10:29:02 -04:00
Joshua Chin
01b286e801 updated word_frequency docstring for Chinese 2015-07-20 10:28:11 -04:00
Joshua Chin
465afb854c updated datafiles 2015-07-20 10:05:27 -04:00
Joshua Chin
221acf7921 fixed build 2015-07-17 17:44:01 -04:00
Rob Speer
2d1020daac mention the Wikipedia data, and credit Hermit Dave 2015-07-17 17:09:36 -04:00
Joshua Chin
f31f9a1bcd fixed tokenize_twitter 2015-07-17 16:37:47 -04:00
Joshua Chin
a44927e98e added cld2 tokenizer comments 2015-07-17 16:03:33 -04:00