Robyn Speer
95cdf41fe8
reformat some argparse argument definitions
...
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Robyn Speer
2069e30c89
fix usage text: one comment, not one tweet
...
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Robyn Speer
883aa5baeb
Separate tokens with spaces, not line breaks, in intermediate files
...
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Robyn Speer
7d1719cfb4
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
682e08fee2
no Thai because we can't tokenize it
...
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Robyn Speer
064ee22a33
forgot about Italian
...
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Robyn Speer
ab8c2e2331
add tokenizer for Reddit
...
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Robyn Speer
6d2709f064
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
7494ae27a7
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
d215f79ea3
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
f2be213933
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
c5d5b0b1fe
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
1d3521dfda
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
5b9b2d2d02
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
a75a95658b
We can put the cutoff back now
...
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.
Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00
Robyn Speer
f330d6d130
remove subtlex-gr from README
...
Former-commit-id: 56318a3ca3
2015-09-04 16:11:46 -04:00
Robyn Speer
8277b34571
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
a69b66b210
Exclude angle brackets from CLD2 detection
...
Former-commit-id: 0d3ee869c1
2015-09-04 14:56:06 -04:00
Robyn Speer
d0ada70355
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
14136d2a01
Note on next languages to support
...
Former-commit-id: 531db64288
2015-09-04 01:50:15 -04:00
Robyn Speer
574c383202
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
f168c37417
Merge branch 'add-subtlex' into greek-and-turkish
...
Former-commit-id: 45d871a815
2015-09-03 23:26:14 -04:00
Robyn Speer
76c751652e
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
3446a393c5
expand Greek and enable Turkish in config
...
Former-commit-id: a3daba81eb
2015-09-03 23:23:31 -04:00
Robyn Speer
f66d03b1b9
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Robyn Speer
247d7c6579
update the build diagram and its script
...
Former-commit-id: 5def3a7897
2015-08-28 17:47:04 -04:00
Robyn Speer
af29fc4f88
fix URL expression
...
Former-commit-id: c4a2594217
2015-08-26 15:00:46 -04:00
Robyn Speer
3a140ee02f
un-flake wordfreq_builder.tokenizers, and edit docstrings
...
Former-commit-id: a893823d6e
2015-08-26 13:03:23 -04:00
Robyn Speer
b22a4b0f02
Strip apostrophes from edges of tokens
...
The issue here is that if you had French text with an apostrophe,
such as "d'un", it would split it into "d'" and "un", but if "d'"
were re-tokenized it would come out as "d". Stripping apostrophes
makes the process more idempotent.
Former-commit-id: 5a1fc00aaa
2015-08-25 12:41:48 -04:00
Robyn Speer
8637aaef9e
use better regexes in wordfreq_builder tokenizer
...
Former-commit-id: de73888a76
2015-08-24 19:05:46 -04:00
Robyn Speer
4ec128adae
remove Hangul fillers that confuse cld2
...
Former-commit-id: 140ca6c050
2015-08-24 17:11:18 -04:00
Andrew Lin
77610f57e1
Stylistic cleanups to word_counts.py.
...
Former-commit-id: 6d40912ef9
2015-07-31 19:26:18 -04:00
Andrew Lin
0711fb3c43
Remove redundant reference to wikipedia in builder README.
...
Former-commit-id: 53621c34df
2015-07-31 19:12:59 -04:00
Robyn Speer
e9dd253f1d
Don't use the file-reading cutoff when writing centibels
...
Former-commit-id: e9f9c94e36
2015-07-28 18:45:26 -04:00