Commit Graph

146 Commits

Author SHA1 Message Date
Robyn Speer
7494ae27a7 fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
d215f79ea3 describe the use of lang in read_values
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03 Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
f2be213933 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
c5d5b0b1fe In ninja deps, remove 'startrow' as a variable
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05 fix spelling of Marc
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
1d3521dfda take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44 update comments in wordfreq_builder.config; remove unused 'version'
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c sort Jieba wordlists consistently; update data files
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.


Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1 actually fix logic of apostrophe-fixing
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920 fix logic of apostrophe-fixing
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db fix '--language' option definition
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700 Avoid Chinese tokenizer when building
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28 language-specific frequency reading; fix 't in English
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py

Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab WIP: fix apostrophe trimming
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
5b9b2d2d02 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
a75a95658b We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130 remove subtlex-gr from README
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
8277b34571 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
a69b66b210 Exclude angle brackets from CLD2 detection
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
d0ada70355 add more SUBTLEX and fix its build rules
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
14136d2a01 Note on next languages to support
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
574c383202 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
f168c37417 Merge branch 'add-subtlex' into greek-and-turkish
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Robyn Speer
76c751652e refer to merge_freqs command correctly
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
3446a393c5 expand Greek and enable Turkish in config
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Robyn Speer
f66d03b1b9 Add SUBTLEX as a source of English and Chinese data
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.


Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Robyn Speer
247d7c6579 update the build diagram and its script
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Robyn Speer
af29fc4f88 fix URL expression
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
3a140ee02f un-flake wordfreq_builder.tokenizers, and edit docstrings
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
b22a4b0f02 Strip apostrophes from edges of tokens
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.


Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
8637aaef9e use better regexes in wordfreq_builder tokenizer
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
4ec128adae remove Hangul fillers that confuse cld2
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
77610f57e1 Stylistic cleanups to word_counts.py.
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
0711fb3c43 Remove redundant reference to wikipedia in builder README.
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Robyn Speer
e9dd253f1d Don't use the file-reading cutoff when writing centibels
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00
Robyn Speer
3ff0f30218 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
33e0493fd5 Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17
Code review fixes 2015 07 17

Former-commit-id: 32102ba3c2
2015-07-22 15:09:00 -04:00
Joshua Chin
292fc96142 updated read_freqs docs
Former-commit-id: 93cd902899
2015-07-22 10:06:16 -04:00
Joshua Chin
d629e8b6cc fixed style
Former-commit-id: 4fe9d110e1
2015-07-22 10:05:11 -04:00
Joshua Chin
f9742c94ca reordered command line args
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
474ae0da35 bugfix
Former-commit-id: 8081145922
2015-07-21 10:12:56 -04:00
Joshua Chin
34504eed80 fixed rules.ninja
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
61a03b87bc fixed build bug
Former-commit-id: 643571c69c
2015-07-20 16:51:25 -04:00
Joshua Chin
af8050f1b8 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00