Rob Speer
c3364ef821
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Joshua Chin
4fa4060036
removed unused scripts
...
Former-commit-id: 39f01b0485
2015-07-17 14:53:18 -04:00
Rob Speer
4771c12814
remove wiki2tokens and tokenize_wikipedia
...
These components are no longer necessary. Wikipedia output can and
should be tokenized with the standard tokenizer, instead of the
almost-equivalent one in the Nim code.
2015-06-30 15:28:01 -04:00
Joshua Chin
1cf7e3d2b9
added pycld2 dependency
2015-06-16 15:06:22 -04:00
Rob Speer
1b7a2b9d0b
fix dependency
2015-05-07 23:55:57 -04:00
Rob Speer
abb0e059c8
a reasonably complete build process
2015-05-07 19:38:33 -04:00
Rob Speer
d2f9c60776
WIP on more build steps
2015-05-07 16:49:53 -04:00
Rob Speer
5787b6bb73
add and adjust some build steps
...
- more build steps for Wikipedia
- rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that
the results are preliminary
2015-05-05 13:59:21 -04:00
Rob Speer
5437bb4e85
WIP on new build system
2015-04-30 16:24:28 -04:00
Rob Speer
693c35476f
Initial commit
2015-02-04 20:19:36 -05:00