Robyn Speer
523806d6db
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
c1f27d3095
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
e77c2dbca8
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Robyn Speer
5b9b2d2d02
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
f7a4e2c444
update data files
...
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Robyn Speer
4704131e13
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
a75a95658b
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
032fea27c3
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Robyn Speer
8277b34571
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
69d65dfda3
update data files (without the CLD2 fix yet)
...
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Robyn Speer
a69b66b210
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
37e510345d
update README with additional SUBTLEX support
...
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Robyn Speer
d0ada70355
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
8035df998a
update the data
...
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Robyn Speer
14136d2a01
Note on next languages to support
...
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
3cb4dd777e
expand list of sources and supported languages
...
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Robyn Speer
574c383202
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
f168c37417
Merge branch 'add-subtlex' into greek-and-turkish
...
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Robyn Speer
76c751652e
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
3446a393c5
expand Greek and enable Turkish in config
...
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Robyn Speer
d267e0967c
add SUBTLEX to the readme
...
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Robyn Speer
f66d03b1b9
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Robyn Speer
42a7d5a439
Merge pull request #24 from LuminosoInsight/manifest
...
Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: 07228fdf1d
2015-09-02 14:34:17 -04:00
Andrew Lin
2089090151
Remove the no-longer-existent .txt files from the MANIFEST.
...
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Andrew Lin
4e8c15cb71
Merge pull request #23 from LuminosoInsight/readme
...
Put documentation and examples in the README
Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Robyn Speer
942761d2f6
fix heading
...
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Robyn Speer
7bdffaae5c
fix list formatting
...
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Robyn Speer
247d7c6579
update the build diagram and its script
...
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Robyn Speer
44c655d9a6
improve README with function documentation and examples
...
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
9fedede771
Merge pull request #22 from LuminosoInsight/standard-tokenizer
...
Use a more standard Unicode tokenizer
Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Robyn Speer
4edfab23ef
update data files
...
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Robyn Speer
2c688b8238
copyedit regex comments
...
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Robyn Speer
0b5d2cdca9
fix typo in docstring
...
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Robyn Speer
af29fc4f88
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
e463397edf
correct the simple_tokenize docstring
...
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Robyn Speer
7fa449729b
refactor the token expression
...
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Robyn Speer
3a140ee02f
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
769d8c627c
remove regex files that are no longer needed
...
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Robyn Speer
6f10e71d29
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Robyn Speer
a3a3180bb9
update the README
...
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Robyn Speer
e3658e0e42
updated data
...
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Robyn Speer
b22a4b0f02
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
0b282c5055
exclude 'extenders' from the start of the token
...
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00