Commit Graph

600 Commits

Author SHA1 Message Date
Robyn Speer
a75a95658b We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
032fea27c3 add more citations
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Robyn Speer
8277b34571 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
69d65dfda3 update data files (without the CLD2 fix yet)
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Robyn Speer
a69b66b210 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
37e510345d update README with additional SUBTLEX support
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Robyn Speer
d0ada70355 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
8035df998a update the data
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Robyn Speer
14136d2a01 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
3cb4dd777e expand list of sources and supported languages
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Robyn Speer
574c383202 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
f168c37417 Merge branch 'add-subtlex' into greek-and-turkish
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Robyn Speer
76c751652e refer to merge_freqs command correctly
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
3446a393c5 expand Greek and enable Turkish in config
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Robyn Speer
d267e0967c add SUBTLEX to the readme
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Robyn Speer
f66d03b1b9 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.


Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Robyn Speer
42a7d5a439 Merge pull request #24 from LuminosoInsight/manifest
Remove the no-longer-existent .txt files from the MANIFEST.

Former-commit-id: 07228fdf1d
2015-09-02 14:34:17 -04:00
Andrew Lin
2089090151 Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Andrew Lin
4e8c15cb71 Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README

Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Robyn Speer
942761d2f6 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Robyn Speer
7bdffaae5c fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Robyn Speer
247d7c6579 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Robyn Speer
44c655d9a6 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
9fedede771 Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer

Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Robyn Speer
4edfab23ef update data files
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Robyn Speer
2c688b8238 copyedit regex comments
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Robyn Speer
0b5d2cdca9 fix typo in docstring
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Robyn Speer
af29fc4f88 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
e463397edf correct the simple_tokenize docstring
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Robyn Speer
7fa449729b refactor the token expression
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Robyn Speer
3a140ee02f un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
769d8c627c remove regex files that are no longer needed
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Robyn Speer
6f10e71d29 bump to version 1.1
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Robyn Speer
a3a3180bb9 update the README
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Robyn Speer
e3658e0e42 updated data
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Robyn Speer
b22a4b0f02 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
0b282c5055 exclude 'extenders' from the start of the token
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Robyn Speer
4801b0d876 update frequency lists
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Robyn Speer
070c89c00c Exclude math and modifier symbols as tokens
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Robyn Speer
8637aaef9e use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
7bdfb74720 also NFKC-normalize Japanese input
Former-commit-id: 554455699d
2015-08-24 18:13:03 -04:00
Robyn Speer
13096b26bd only NFKC-normalize in Arabic
Former-commit-id: 1d055edc1c
2015-08-24 17:55:17 -04:00
Robyn Speer
4ec128adae remove Hangul fillers that confuse cld2
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Robyn Speer
3674d35501 remove obsolete gen_regex.py
Former-commit-id: 102bc715ae
2015-08-24 17:11:18 -04:00
Robyn Speer
8795525372 Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Robyn Speer
e15fc14b8e Merge pull request #21 from LuminosoInsight/review-notes
Review notes

Former-commit-id: 2b8089e2b1
2015-08-03 14:48:15 -04:00
Andrew Lin
e88cf3fdaf Document the NFKC-normalized ligature in the Arabic test.
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
77610f57e1 Stylistic cleanups to word_counts.py.
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
b0fac15f98 Switch to more explanatory Unicode escapes when testing NFKC normalization.
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00