Rob Speer
62f5a8eb1e
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
a555e5dc13
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
1d4a18ead2
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Rob Speer
63295fc397
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
0441a81bbe
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
138e8aaa3f
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a8161b1067
update data files (without the CLD2 fix yet)
...
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Rob Speer
3a8b2c2c81
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Rob Speer
a0997a79a4
update README with additional SUBTLEX support
...
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
b1d158ab41
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
f993ffcdf2
update the data
...
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Rob Speer
25e24f9c32
Note on next languages to support
...
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Rob Speer
bf88f97744
expand list of sources and supported languages
...
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
89763679de
Merge branch 'add-subtlex' into greek-and-turkish
...
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Rob Speer
ad4b12bee9
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
7a2f2035ab
expand Greek and enable Turkish in config
...
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Rob Speer
a92c398258
add SUBTLEX to the readme
...
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
cb5b696ffa
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Andrew Lin
b693715663
Merge pull request #23 from LuminosoInsight/readme
...
Put documentation and examples in the README
Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Rob Speer
d883eaeca5
fix heading
...
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181
fix list formatting
...
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
4aac7bdd65
update the build diagram and its script
...
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
43fd15c938
improve README with function documentation and examples
...
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
5a47427f6e
Merge pull request #22 from LuminosoInsight/standard-tokenizer
...
Use a more standard Unicode tokenizer
Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Rob Speer
db5a4502b8
update data files
...
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Rob Speer
001180ca86
copyedit regex comments
...
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Rob Speer
dae953525e
fix typo in docstring
...
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Rob Speer
49bd631632
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
6286946cc3
correct the simple_tokenize docstring
...
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Rob Speer
232aee9c66
refactor the token expression
...
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Rob Speer
40d6b85d67
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
7a757d9ec9
remove regex files that are no longer needed
...
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Rob Speer
1f5c828642
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
d064fbec7d
update the README
...
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Rob Speer
244735ce4d
updated data
...
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Rob Speer
a3b37f6619
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
1042f87efe
exclude 'extenders' from the start of the token
...
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Rob Speer
a5b8c5a745
update frequency lists
...
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Rob Speer
99a312ce06
Exclude math and modifier symbols as tokens
...
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Rob Speer
6647cf9035
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Rob Speer
decd7dae60
also NFKC-normalize Japanese input
...
Former-commit-id: 554455699d
2015-08-24 18:13:03 -04:00
Rob Speer
9178c6de37
only NFKC-normalize in Arabic
...
Former-commit-id: 1d055edc1c
2015-08-24 17:55:17 -04:00
Rob Speer
6a33b46cfd
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Rob Speer
759a8199fb
remove obsolete gen_regex.py
...
Former-commit-id: 102bc715ae
2015-08-24 17:11:18 -04:00
Rob Speer
f4cf46ab9c
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Rob Speer
0721707d92
Merge pull request #21 from LuminosoInsight/review-notes
...
Review notes
Former-commit-id: 2b8089e2b1
2015-08-03 14:48:15 -04:00
Andrew Lin
10bddfe09f
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00