Rob Speer
5c8c36f4e3
Lower the frequency of phrases with inferred token boundaries
2015-09-10 14:16:22 -04:00
Rob Speer
6502f15e9b
fix SUBTLEX citations
2015-09-08 17:45:25 -04:00
Rob Speer
d9c44d5fcc
take out OpenSubtitles for Chinese
2015-09-08 17:25:05 -04:00
Rob Speer
bc323eccaf
update comments in wordfreq_builder.config; remove unused 'version'
2015-09-08 16:15:29 -04:00
Rob Speer
0ab23f8a28
sort Jieba wordlists consistently; update data files
2015-09-08 16:09:53 -04:00
Rob Speer
bc8ebd23e9
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
2015-09-08 14:46:04 -04:00
Rob Speer
715361ca0d
actually fix logic of apostrophe-fixing
2015-09-08 13:50:34 -04:00
Rob Speer
c4c1af8213
fix logic of apostrophe-fixing
2015-09-08 13:47:58 -04:00
Rob Speer
912171f8e7
fix '--language' option definition
2015-09-08 13:27:20 -04:00
Rob Speer
77a9b5c55b
Avoid Chinese tokenizer when building
2015-09-08 12:59:03 -04:00
Rob Speer
9071defb33
language-specific frequency reading; fix 't in English
2015-09-08 12:49:21 -04:00
Rob Speer
20f2828d0a
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
2015-09-08 12:29:00 -04:00
Rob Speer
e39d345c4b
WIP: fix apostrophe trimming
2015-09-08 12:28:28 -04:00
Rob Speer
d576e3294b
update the README for Chinese
2015-09-05 03:42:54 -04:00
Rob Speer
2327f2e4d6
tokenize Chinese using jieba and our own frequencies
2015-09-05 03:16:56 -04:00
Rob Speer
7906a671ea
WIP: Traditional Chinese
2015-09-04 18:52:37 -04:00
Rob Speer
3c3371a9ff
add Polish and Swedish to README
2015-09-04 17:10:40 -04:00
Rob Speer
447d7e5134
add Polish and Swedish, which have sufficient data
2015-09-04 17:10:40 -04:00
Rob Speer
25edaad962
update data files
2015-09-04 17:00:55 -04:00
Rob Speer
fc93c8dc9c
add tests for Turkish
2015-09-04 17:00:05 -04:00
Rob Speer
5c7a7ea83e
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
2015-09-04 16:16:52 -04:00
Rob Speer
56318a3ca3
remove subtlex-gr from README
2015-09-04 16:11:46 -04:00
Rob Speer
8196643509
add more citations
2015-09-04 15:57:40 -04:00
Rob Speer
77c60c29b0
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
2015-09-04 15:52:21 -04:00
Rob Speer
a47497c908
update data files (without the CLD2 fix yet)
2015-09-04 14:58:20 -04:00
Rob Speer
0d3ee869c1
Exclude angle brackets from CLD2 detection
2015-09-04 14:56:06 -04:00
Rob Speer
81bbe663fb
update README with additional SUBTLEX support
2015-09-04 13:23:33 -04:00
Rob Speer
34474939f2
add more SUBTLEX and fix its build rules
2015-09-04 12:37:35 -04:00
Rob Speer
c11e3b7a9d
update the data
2015-09-04 02:07:50 -04:00
Rob Speer
531db64288
Note on next languages to support
2015-09-04 01:50:15 -04:00
Rob Speer
d9a1c34d00
expand list of sources and supported languages
2015-09-04 01:03:36 -04:00
Rob Speer
d94428d454
support Turkish and more Greek; document more
2015-09-04 00:57:04 -04:00
Rob Speer
45d871a815
Merge branch 'add-subtlex' into greek-and-turkish
2015-09-03 23:26:14 -04:00
Rob Speer
40d82541ba
refer to merge_freqs command correctly
2015-09-03 23:25:46 -04:00
Rob Speer
a3daba81eb
expand Greek and enable Turkish in config
2015-09-03 23:23:31 -04:00
Rob Speer
e6a2886a66
add SUBTLEX to the readme
2015-09-03 18:56:56 -04:00
Rob Speer
2d58ba94f2
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
2015-09-03 18:13:13 -04:00
Rob Speer
07228fdf1d
Merge pull request #24 from LuminosoInsight/manifest
...
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:34:17 -04:00
Andrew Lin
db41bc7902
Remove the no-longer-existent .txt files from the MANIFEST.
2015-09-02 14:27:15 -04:00
Andrew Lin
e43b5ebf7b
Merge pull request #23 from LuminosoInsight/readme
...
Put documentation and examples in the README
2015-08-28 17:59:17 -04:00
Rob Speer
00a2812907
fix heading
2015-08-28 17:49:38 -04:00
Rob Speer
93f44683c5
fix list formatting
2015-08-28 17:49:07 -04:00
Rob Speer
5def3a7897
update the build diagram and its script
2015-08-28 17:47:04 -04:00
Rob Speer
2370287539
improve README with function documentation and examples
2015-08-28 17:45:50 -04:00
Andrew Lin
e6d9b36203
Merge pull request #22 from LuminosoInsight/standard-tokenizer
...
Use a more standard Unicode tokenizer
2015-08-27 11:56:19 -04:00
Rob Speer
b952676679
update data files
2015-08-27 03:58:54 -04:00
Rob Speer
d5fcf4407e
copyedit regex comments
2015-08-26 17:04:56 -04:00
Rob Speer
34375958ef
fix typo in docstring
2015-08-26 16:24:35 -04:00
Rob Speer
c4a2594217
fix URL expression
2015-08-26 15:00:46 -04:00
Rob Speer
f7babea352
correct the simple_tokenize docstring
2015-08-26 13:54:50 -04:00