Commit Graph

341 Commits

Author SHA1 Message Date
Rob Speer
34474939f2 add more SUBTLEX and fix its build rules 2015-09-04 12:37:35 -04:00
Rob Speer
c11e3b7a9d update the data 2015-09-04 02:07:50 -04:00
Rob Speer
531db64288 Note on next languages to support 2015-09-04 01:50:15 -04:00
Rob Speer
d9a1c34d00 expand list of sources and supported languages 2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454 support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815 Merge branch 'add-subtlex' into greek-and-turkish 2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba refer to merge_freqs command correctly 2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb expand Greek and enable Turkish in config 2015-09-03 23:23:31 -04:00
Rob Speer
e6a2886a66 add SUBTLEX to the readme 2015-09-03 18:56:56 -04:00
Rob Speer
2d58ba94f2 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Andrew Lin
e43b5ebf7b Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README
2015-08-28 17:59:17 -04:00
Rob Speer
00a2812907 fix heading 2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5 fix list formatting 2015-08-28 17:49:07 -04:00
Rob Speer
5def3a7897 update the build diagram and its script 2015-08-28 17:47:04 -04:00
Rob Speer
2370287539 improve README with function documentation and examples 2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203 Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679 update data files 2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e copyedit regex comments 2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef fix typo in docstring 2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217 fix URL expression 2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352 correct the simple_tokenize docstring 2015-08-26 13:54:50 -04:00
Rob Speer
01b6403ef4 refactor the token expression 2015-08-26 13:40:47 -04:00
Rob Speer
a893823d6e un-flake wordfreq_builder.tokenizers, and edit docstrings 2015-08-26 13:03:23 -04:00
Rob Speer
94467a6563 remove regex files that are no longer needed 2015-08-26 11:48:11 -04:00
Rob Speer
694c28d5e4 bump to version 1.1 2015-08-25 17:44:52 -04:00
Rob Speer
573dd1ec79 update the README 2015-08-25 17:44:34 -04:00
Rob Speer
353b8045da updated data 2015-08-25 17:16:03 -04:00
Rob Speer
5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068 exclude 'extenders' from the start of the token 2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27 update frequency lists 2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c Exclude math and modifier symbols as tokens 2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76 use better regexes in wordfreq_builder tokenizer 2015-08-24 19:05:46 -04:00
Rob Speer
554455699d also NFKC-normalize Japanese input 2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c only NFKC-normalize in Arabic 2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050 remove Hangul fillers that confuse cld2 2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae remove obsolete gen_regex.py 2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1 Merge pull request #21 from LuminosoInsight/review-notes
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8 Document the NFKC-normalized ligature in the Arabic test. 2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9 Stylistic cleanups to word_counts.py. 2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac Switch to more explanatory Unicode escapes when testing NFKC normalization. 2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df Remove redundant reference to wikipedia in builder README. 2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374 Merge pull request #20 from LuminosoInsight/cutoff-fix
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36 Don't use the file-reading cutoff when writing centibels 2015-07-28 18:45:26 -04:00
Rob Speer
eb4b3cad50 update wordlists with cutoff fix 2015-07-28 18:03:12 -04:00
Rob Speer
c5708b24e4 put back the freqs_to_cBpack cutoff; prepare for 1.0 2015-07-28 18:01:12 -04:00
Rob Speer
32102ba3c2 Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17
2015-07-22 15:09:00 -04:00
Joshua Chin
93cd902899 updated read_freqs docs 2015-07-22 10:06:16 -04:00
Joshua Chin
4fe9d110e1 fixed style 2015-07-22 10:05:11 -04:00
Joshua Chin
6453d864c4 reordered command line args 2015-07-22 10:04:14 -04:00