Commit Graph

447 Commits

Author SHA1 Message Date
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908 update data files (without the CLD2 fix yet) 2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1 Exclude angle brackets from CLD2 detection 2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb update README with additional SUBTLEX support 2015-09-04 13:23:33 -04:00
Rob Speer
34474939f2 add more SUBTLEX and fix its build rules 2015-09-04 12:37:35 -04:00
Rob Speer
c11e3b7a9d update the data 2015-09-04 02:07:50 -04:00
Rob Speer
531db64288 Note on next languages to support 2015-09-04 01:50:15 -04:00
Rob Speer
d9a1c34d00 expand list of sources and supported languages 2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454 support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815 Merge branch 'add-subtlex' into greek-and-turkish 2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba refer to merge_freqs command correctly 2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb expand Greek and enable Turkish in config 2015-09-03 23:23:31 -04:00
Rob Speer
e6a2886a66 add SUBTLEX to the readme 2015-09-03 18:56:56 -04:00
Rob Speer
2d58ba94f2 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Rob Speer
07228fdf1d Merge pull request #24 from LuminosoInsight/manifest
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:34:17 -04:00
Andrew Lin
db41bc7902 Remove the no-longer-existent .txt files from the MANIFEST. 2015-09-02 14:27:15 -04:00
Andrew Lin
e43b5ebf7b Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README
2015-08-28 17:59:17 -04:00
Rob Speer
00a2812907 fix heading 2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5 fix list formatting 2015-08-28 17:49:07 -04:00
Rob Speer
5def3a7897 update the build diagram and its script 2015-08-28 17:47:04 -04:00
Rob Speer
2370287539 improve README with function documentation and examples 2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203 Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679 update data files 2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e copyedit regex comments 2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef fix typo in docstring 2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217 fix URL expression 2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352 correct the simple_tokenize docstring 2015-08-26 13:54:50 -04:00
Rob Speer
01b6403ef4 refactor the token expression 2015-08-26 13:40:47 -04:00
Rob Speer
a893823d6e un-flake wordfreq_builder.tokenizers, and edit docstrings 2015-08-26 13:03:23 -04:00
Rob Speer
94467a6563 remove regex files that are no longer needed 2015-08-26 11:48:11 -04:00
Rob Speer
694c28d5e4 bump to version 1.1 2015-08-25 17:44:52 -04:00
Rob Speer
573dd1ec79 update the README 2015-08-25 17:44:34 -04:00
Rob Speer
353b8045da updated data 2015-08-25 17:16:03 -04:00
Rob Speer
5a1fc00aaa Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068 exclude 'extenders' from the start of the token 2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27 update frequency lists 2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c Exclude math and modifier symbols as tokens 2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76 use better regexes in wordfreq_builder tokenizer 2015-08-24 19:05:46 -04:00
Rob Speer
554455699d also NFKC-normalize Japanese input 2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c only NFKC-normalize in Arabic 2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050 remove Hangul fillers that confuse cld2 2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae remove obsolete gen_regex.py 2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad Use the regex implementation of Unicode segmentation 2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1 Merge pull request #21 from LuminosoInsight/review-notes
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8 Document the NFKC-normalized ligature in the Arabic test. 2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9 Stylistic cleanups to word_counts.py. 2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac Switch to more explanatory Unicode escapes when testing NFKC normalization. 2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df Remove redundant reference to wikipedia in builder README. 2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374 Merge pull request #20 from LuminosoInsight/cutoff-fix
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36 Don't use the file-reading cutoff when writing centibels 2015-07-28 18:45:26 -04:00