Rob Speer
77c60c29b0
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908
update data files (without the CLD2 fix yet)
2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1
Exclude angle brackets from CLD2 detection
2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb
update README with additional SUBTLEX support
2015-09-04 13:23:33 -04:00
Rob Speer
34474939f2
add more SUBTLEX and fix its build rules
2015-09-04 12:37:35 -04:00
Rob Speer
c11e3b7a9d
update the data
2015-09-04 02:07:50 -04:00
Rob Speer
531db64288
Note on next languages to support
2015-09-04 01:50:15 -04:00
Rob Speer
d9a1c34d00
expand list of sources and supported languages
2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454
support Turkish and more Greek; document more
2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815
Merge branch 'add-subtlex' into greek-and-turkish
2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba
refer to merge_freqs command correctly
2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb
expand Greek and enable Turkish in config
2015-09-03 23:23:31 -04:00
Rob Speer
e6a2886a66
add SUBTLEX to the readme
2015-09-03 18:56:56 -04:00
Rob Speer
2d58ba94f2
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Rob Speer
07228fdf1d
Merge pull request #24 from LuminosoInsight/manifest
...
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:34:17 -04:00
Andrew Lin
db41bc7902
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:27:15 -04:00
Andrew Lin
e43b5ebf7b
Merge pull request #23 from LuminosoInsight/readme
...
Put documentation and examples in the README
2015-08-28 17:59:17 -04:00
Rob Speer
00a2812907
fix heading
2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5
fix list formatting
2015-08-28 17:49:07 -04:00
Rob Speer
5def3a7897
update the build diagram and its script
2015-08-28 17:47:04 -04:00
Rob Speer
2370287539
improve README with function documentation and examples
2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203
Merge pull request #22 from LuminosoInsight/standard-tokenizer
...
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679
update data files
2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e
copyedit regex comments
2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef
fix typo in docstring
2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217
fix URL expression
2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352
correct the simple_tokenize docstring
2015-08-26 13:54:50 -04:00
Rob Speer
01b6403ef4
refactor the token expression
2015-08-26 13:40:47 -04:00
Rob Speer
a893823d6e
un-flake wordfreq_builder.tokenizers, and edit docstrings
2015-08-26 13:03:23 -04:00
Rob Speer
94467a6563
remove regex files that are no longer needed
2015-08-26 11:48:11 -04:00
Rob Speer
694c28d5e4
bump to version 1.1
2015-08-25 17:44:52 -04:00
Rob Speer
573dd1ec79
update the README
2015-08-25 17:44:34 -04:00
Rob Speer
353b8045da
updated data
2015-08-25 17:16:03 -04:00
Rob Speer
5a1fc00aaa
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
2015-08-25 12:41:48 -04:00
Rob Speer
a8e7c29068
exclude 'extenders' from the start of the token
2015-08-25 12:33:12 -04:00
Rob Speer
0d600bdf27
update frequency lists
2015-08-25 11:43:59 -04:00
Rob Speer
8f3c9f576c
Exclude math and modifier symbols as tokens
2015-08-25 11:43:22 -04:00
Rob Speer
de73888a76
use better regexes in wordfreq_builder tokenizer
2015-08-24 19:05:46 -04:00
Rob Speer
554455699d
also NFKC-normalize Japanese input
2015-08-24 18:13:03 -04:00
Rob Speer
1d055edc1c
only NFKC-normalize in Arabic
2015-08-24 17:55:17 -04:00
Rob Speer
140ca6c050
remove Hangul fillers that confuse cld2
2015-08-24 17:11:18 -04:00
Rob Speer
102bc715ae
remove obsolete gen_regex.py
2015-08-24 17:11:18 -04:00
Rob Speer
95998205ad
Use the regex implementation of Unicode segmentation
2015-08-24 17:11:08 -04:00
Rob Speer
2b8089e2b1
Merge pull request #21 from LuminosoInsight/review-notes
...
Review notes
2015-08-03 14:48:15 -04:00
Andrew Lin
41e1dd41d8
Document the NFKC-normalized ligature in the Arabic test.
2015-08-03 11:09:44 -04:00
Andrew Lin
6d40912ef9
Stylistic cleanups to word_counts.py.
2015-07-31 19:26:18 -04:00
Andrew Lin
66c69e6fac
Switch to more explanatory Unicode escapes when testing NFKC normalization.
2015-07-31 19:23:42 -04:00
Andrew Lin
53621c34df
Remove redundant reference to wikipedia in builder README.
2015-07-31 19:12:59 -04:00
Andrew Lin
742e2b3374
Merge pull request #20 from LuminosoInsight/cutoff-fix
...
put back the freqs_to_cBpack cutoff; prepare for 1.0
2015-07-29 11:43:41 -04:00
Rob Speer
e9f9c94e36
Don't use the file-reading cutoff when writing centibels
2015-07-28 18:45:26 -04:00