Commit Graph

559 Commits

Author SHA1 Message Date
Rob Speer
30237cf73d Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py

Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Rob Speer
854247bf8b WIP: fix apostrophe trimming
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Rob Speer
b4100b5bfb update the README for Chinese
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
91cc82f76d tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e add Polish and Swedish to README
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
a555e5dc13 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Rob Speer
1d4a18ead2 update data files
Former-commit-id: 25edaad962
2015-09-04 17:00:55 -04:00
Rob Speer
63295fc397 add tests for Turkish
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
0441a81bbe We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Rob Speer
917ce398a2 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Rob Speer
138e8aaa3f add more citations
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a8161b1067 update data files (without the CLD2 fix yet)
Former-commit-id: a47497c908
2015-09-04 14:58:20 -04:00
Rob Speer
3a8b2c2c81 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Rob Speer
a0997a79a4 update README with additional SUBTLEX support
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
b1d158ab41 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
f993ffcdf2 update the data
Former-commit-id: c11e3b7a9d
2015-09-04 02:07:50 -04:00
Rob Speer
25e24f9c32 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Rob Speer
bf88f97744 expand list of sources and supported languages
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
89763679de Merge branch 'add-subtlex' into greek-and-turkish
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Rob Speer
ad4b12bee9 refer to merge_freqs command correctly
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
7a2f2035ab expand Greek and enable Turkish in config
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Rob Speer
a92c398258 add SUBTLEX to the readme
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
cb5b696ffa Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.


Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Rob Speer
531cedbf55 Merge pull request #24 from LuminosoInsight/manifest
Remove the no-longer-existent .txt files from the MANIFEST.

Former-commit-id: 07228fdf1d
2015-09-02 14:34:17 -04:00
Andrew Lin
65d6645e81 Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Andrew Lin
b693715663 Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README

Former-commit-id: e43b5ebf7b
2015-08-28 17:59:17 -04:00
Rob Speer
d883eaeca5 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181 fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
4aac7bdd65 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Rob Speer
43fd15c938 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Andrew Lin
5a47427f6e Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer

Former-commit-id: e6d9b36203
2015-08-27 11:56:19 -04:00
Rob Speer
db5a4502b8 update data files
Former-commit-id: b952676679
2015-08-27 03:58:54 -04:00
Rob Speer
001180ca86 copyedit regex comments
Former-commit-id: d5fcf4407e
2015-08-26 17:04:56 -04:00
Rob Speer
dae953525e fix typo in docstring
Former-commit-id: 34375958ef
2015-08-26 16:24:35 -04:00
Rob Speer
49bd631632 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Rob Speer
6286946cc3 correct the simple_tokenize docstring
Former-commit-id: f7babea352
2015-08-26 13:54:50 -04:00
Rob Speer
232aee9c66 refactor the token expression
Former-commit-id: 01b6403ef4
2015-08-26 13:40:47 -04:00
Rob Speer
40d6b85d67 un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Rob Speer
7a757d9ec9 remove regex files that are no longer needed
Former-commit-id: 94467a6563
2015-08-26 11:48:11 -04:00
Rob Speer
1f5c828642 bump to version 1.1
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
d064fbec7d update the README
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Rob Speer
244735ce4d updated data
Former-commit-id: 353b8045da
2015-08-25 17:16:03 -04:00
Rob Speer
a3b37f6619 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Rob Speer
1042f87efe exclude 'extenders' from the start of the token
Former-commit-id: a8e7c29068
2015-08-25 12:33:12 -04:00
Rob Speer
a5b8c5a745 update frequency lists
Former-commit-id: 0d600bdf27
2015-08-25 11:43:59 -04:00
Rob Speer
99a312ce06 Exclude math and modifier symbols as tokens
Former-commit-id: 8f3c9f576c
2015-08-25 11:43:22 -04:00
Rob Speer
6647cf9035 use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00