Robyn Speer
2a41d4dc5e
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
8d09b68d37
wordfreq_builder: Document the extract_reddit pipeline
...
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Robyn Speer
2840ca55aa
filter out downvoted Reddit posts
...
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Robyn Speer
969a024dea
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
738243e244
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Robyn Speer
7d1719cfb4
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
6d2709f064
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
7494ae27a7
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
4aef1dc338
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
3fa14ded28
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
d0ada70355
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Robyn Speer
76c751652e
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Robyn Speer
f66d03b1b9
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Joshua Chin
f9742c94ca
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
34504eed80
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
c2f3928433
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
a340a15870
removed mkdir -p for many cases
...
Former-commit-id: 98a7a8093b
2015-07-17 14:45:22 -04:00
Robyn Speer
deed2f767c
remove wiki2tokens and tokenize_wikipedia
...
These components are no longer necessary. Wikipedia output can and
should be tokenized with the standard tokenizer, instead of the
almost-equivalent one in the Nim code.
2015-06-30 15:28:01 -04:00
Robyn Speer
f17a04aa84
fix comment and whitespace involving tokenize_twitter
2015-06-30 15:18:37 -04:00
Robyn Speer
91d6edd55b
Switch to a centibel scale, add a header to the data
2015-06-22 17:38:13 -04:00
Joshua Chin
6f0a082007
removed intermediate twitter file rules
2015-06-16 17:28:09 -04:00
Robyn Speer
a5954d14df
give mecab a larger buffer
2015-05-26 19:34:46 -04:00
Robyn Speer
4f738ad78c
correct a Leeds bug; add some comments to rules.ninja
2015-05-26 18:08:04 -04:00
Robyn Speer
4513fed60c
add Google Books data for English
2015-05-11 18:44:28 -04:00
Robyn Speer
aa55e32450
Makefile should only be needed for bootstrapping Ninja
2015-05-08 12:39:31 -04:00
Robyn Speer
a5f6113824
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Robyn Speer
04bde8d617
WIP on more build steps
2015-05-07 16:49:53 -04:00
Robyn Speer
7c09fec692
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Robyn Speer
c55e44e486
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Robyn Speer
59409266ca
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Robyn Speer
efcf436112
WIP on new build system
2015-04-30 16:24:28 -04:00
Robyn Speer
76ea7f1bd5
define some ninja rules
2015-04-29 17:13:58 -04:00
Robyn Speer
524f7c760b
WIP on Ninja build automation
2015-04-29 15:59:06 -04:00