Commit Graph

366 Commits

Author SHA1 Message Date
Rob Speer
0ab23f8a28 sort Jieba wordlists consistently; update data files 2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d actually fix logic of apostrophe-fixing 2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213 fix logic of apostrophe-fixing 2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7 fix '--language' option definition 2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b Avoid Chinese tokenizer when building 2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33 language-specific frequency reading; fix 't in English 2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b WIP: fix apostrophe trimming 2015-09-08 12:28:28 -04:00
Rob Speer
d576e3294b update the README for Chinese 2015-09-05 03:42:54 -04:00
Rob Speer
2327f2e4d6 tokenize Chinese using jieba and our own frequencies 2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea WIP: Traditional Chinese 2015-09-04 18:52:37 -04:00
Rob Speer
3c3371a9ff add Polish and Swedish to README 2015-09-04 17:10:40 -04:00
Rob Speer
447d7e5134 add Polish and Swedish, which have sufficient data 2015-09-04 17:10:40 -04:00
Rob Speer
25edaad962 update data files 2015-09-04 17:00:55 -04:00
Rob Speer
fc93c8dc9c add tests for Turkish 2015-09-04 17:00:05 -04:00
Rob Speer
5c7a7ea83e We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3 remove subtlex-gr from README 2015-09-04 16:11:46 -04:00
Rob Speer
8196643509 add more citations 2015-09-04 15:57:40 -04:00
Rob Speer
77c60c29b0 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908 update data files (without the CLD2 fix yet) 2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1 Exclude angle brackets from CLD2 detection 2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb update README with additional SUBTLEX support 2015-09-04 13:23:33 -04:00
Rob Speer
34474939f2 add more SUBTLEX and fix its build rules 2015-09-04 12:37:35 -04:00
Rob Speer
c11e3b7a9d update the data 2015-09-04 02:07:50 -04:00
Rob Speer
531db64288 Note on next languages to support 2015-09-04 01:50:15 -04:00
Rob Speer
d9a1c34d00 expand list of sources and supported languages 2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454 support Turkish and more Greek; document more 2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815 Merge branch 'add-subtlex' into greek-and-turkish 2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba refer to merge_freqs command correctly 2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb expand Greek and enable Turkish in config 2015-09-03 23:23:31 -04:00
Rob Speer
e6a2886a66 add SUBTLEX to the readme 2015-09-03 18:56:56 -04:00
Rob Speer
2d58ba94f2 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Rob Speer
07228fdf1d Merge pull request #24 from LuminosoInsight/manifest
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:34:17 -04:00
Andrew Lin
db41bc7902 Remove the no-longer-existent .txt files from the MANIFEST. 2015-09-02 14:27:15 -04:00
Andrew Lin
e43b5ebf7b Merge pull request #23 from LuminosoInsight/readme
Put documentation and examples in the README
2015-08-28 17:59:17 -04:00
Rob Speer
00a2812907 fix heading 2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5 fix list formatting 2015-08-28 17:49:07 -04:00
Rob Speer
5def3a7897 update the build diagram and its script 2015-08-28 17:47:04 -04:00
Rob Speer
2370287539 improve README with function documentation and examples 2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203 Merge pull request #22 from LuminosoInsight/standard-tokenizer
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679 update data files 2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e copyedit regex comments 2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef fix typo in docstring 2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217 fix URL expression 2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352 correct the simple_tokenize docstring 2015-08-26 13:54:50 -04:00
Rob Speer
01b6403ef4 refactor the token expression 2015-08-26 13:40:47 -04:00
Rob Speer
a893823d6e un-flake wordfreq_builder.tokenizers, and edit docstrings 2015-08-26 13:03:23 -04:00
Rob Speer
94467a6563 remove regex files that are no longer needed 2015-08-26 11:48:11 -04:00
Rob Speer
694c28d5e4 bump to version 1.1 2015-08-25 17:44:52 -04:00