Commit Graph

418 Commits

Author SHA1 Message Date
Rob Speer
0f9497d864 take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
5e86394c4c update comments in wordfreq_builder.config; remove unused 'version'
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Rob Speer
2dfaf7798d sort Jieba wordlists consistently; update data files
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Rob Speer
01332f1ed5 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.


Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
86475d6b5f actually fix logic of apostrophe-fixing
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Rob Speer
6bd0979ad2 fix logic of apostrophe-fixing
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Rob Speer
8c3fb9f716 fix '--language' option definition
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Rob Speer
67bb55988e Avoid Chinese tokenizer when building
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Rob Speer
11202ad7f5 language-specific frequency reading; fix 't in English
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Rob Speer
30237cf73d Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py

Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Rob Speer
854247bf8b WIP: fix apostrophe trimming
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Rob Speer
b4100b5bfb update the README for Chinese
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
91cc82f76d tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e add Polish and Swedish to README
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
a555e5dc13 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
1d4a18ead2 update data files
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Rob Speer
63295fc397 add tests for Turkish
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
0441a81bbe We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
138e8aaa3f add more citations
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a8161b1067 update data files (without the CLD2 fix yet)
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Rob Speer
3a8b2c2c81 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Rob Speer
a0997a79a4 update README with additional SUBTLEX support
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
b1d158ab41 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
f993ffcdf2 update the data
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Rob Speer
25e24f9c32 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Rob Speer
bf88f97744 expand list of sources and supported languages
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
89763679de Merge branch 'add-subtlex' into greek-and-turkish
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Rob Speer
ad4b12bee9 refer to merge_freqs command correctly
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
7a2f2035ab expand Greek and enable Turkish in config
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Rob Speer
a92c398258 add SUBTLEX to the readme
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
cb5b696ffa Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.


Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Rob Speer
531cedbf55 Merge pull request #24 from LuminosoInsight/manifest
Remove the no-longer-existent .txt files from the MANIFEST.

Former-commit-id: 07228fdf1d
2015-09-02 14:34:17 -04:00
Andrew Lin
65d6645e81 Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Andrew Lin
b693715663 Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README

Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Rob Speer
d883eaeca5 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181 fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
4aac7bdd65 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
43fd15c938 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
5a47427f6e Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer

Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Rob Speer
db5a4502b8 update data files
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Rob Speer
001180ca86 copyedit regex comments
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Rob Speer
dae953525e fix typo in docstring
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Rob Speer
49bd631632 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
6286946cc3 correct the simple_tokenize docstring
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Rob Speer
232aee9c66 refactor the token expression
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Rob Speer
40d6b85d67 un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00