Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
f539eecdd6
wordfreq_builder: Document the extract_reddit pipeline
...
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Rob Speer
cebf99f7ba
filter out downvoted Reddit posts
...
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Rob Speer
c3364ef821
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Rob Speer
f4761029d0
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Rob Speer
6d62a8ff51
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
d1b667909d
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
7435c8f57a
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Rob Speer
01332f1ed5
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
11202ad7f5
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
b1d158ab41
add more SUBTLEX and fix its build rules
...
Former-commit-id: 34474939f2
2015-09-04 12:37:35 -04:00
Rob Speer
ad4b12bee9
refer to merge_freqs command correctly
...
Former-commit-id: 40d82541ba
2015-09-03 23:25:46 -04:00
Rob Speer
cb5b696ffa
Add SUBTLEX as a source of English and Chinese data
...
Meanwhile, fix up the dependency graph thingy. It's actually kind of
legible now.
Former-commit-id: 2d58ba94f2
2015-09-03 18:13:13 -04:00
Joshua Chin
78324e74eb
reordered command line args
...
Former-commit-id: 6453d864c4
2015-07-22 10:04:14 -04:00
Joshua Chin
0a2f2877af
fixed rules.ninja
...
Former-commit-id: c5f82ecac1
2015-07-20 17:20:29 -04:00
Joshua Chin
5c7e0dd0dd
fix arabic tokens
...
Former-commit-id: 11a1c51321
2015-07-17 15:52:12 -04:00
Joshua Chin
631a5f1b71
removed mkdir -p for many cases
...
Former-commit-id: 98a7a8093b
2015-07-17 14:45:22 -04:00
Rob Speer
4771c12814
remove wiki2tokens and tokenize_wikipedia
...
These components are no longer necessary. Wikipedia output can and
should be tokenized with the standard tokenizer, instead of the
almost-equivalent one in the Nim code.
2015-06-30 15:28:01 -04:00
Rob Speer
9a2855394d
fix comment and whitespace involving tokenize_twitter
2015-06-30 15:18:37 -04:00
Rob Speer
f305679caf
Switch to a centibel scale, add a header to the data
2015-06-22 17:38:13 -04:00
Joshua Chin
da93bc89c2
removed intermediate twitter file rules
2015-06-16 17:28:09 -04:00
Rob Speer
536c15fbdb
give mecab a larger buffer
2015-05-26 19:34:46 -04:00
Rob Speer
ffd352f148
correct a Leeds bug; add some comments to rules.ninja
2015-05-26 18:08:04 -04:00
Rob Speer
50ff85ce19
add Google Books data for English
2015-05-11 18:44:28 -04:00
Rob Speer
d6cc90792f
Makefile should only be needed for bootstrapping Ninja
2015-05-08 12:39:31 -04:00
Rob Speer
abb0e059c8
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Rob Speer
d2f9c60776
WIP on more build steps
2015-05-07 16:49:53 -04:00
Rob Speer
16928ed182
add rules to count wikipedia tokens
2015-05-05 15:21:24 -04:00
Rob Speer
bd579e2319
fix the 'count' ninja rule
2015-05-05 14:06:13 -04:00
Rob Speer
5787b6bb73
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Rob Speer
5437bb4e85
WIP on new build system
2015-04-30 16:24:28 -04:00
Rob Speer
4dae2f8caf
define some ninja rules
2015-04-29 17:13:58 -04:00
Rob Speer
14e445a937
WIP on Ninja build automation
2015-04-29 15:59:06 -04:00