more README fixes

This commit is contained in:
Joshua Chin 2015-07-17 14:40:33 -04:00
parent 0a085132f4
commit 772c0cddd1

View File

@ -47,8 +47,7 @@ Start the build, and find something else to do for a few hours:
ninja -v ninja -v
You can copy the results into wordfreq with this command (supposing that You can copy the results into wordfreq with this command:
$WORDFREQ points to your wordfreq repo):
cp data/dist/*.msgpack.gz ../wordfreq/data/ cp data/dist/*.msgpack.gz ../wordfreq/data/
@ -90,9 +89,11 @@ Wikipedia is a "free-access, free-content Internet encyclopedia".
These files can be downloaded from [wikimedia dump][wikipedia] These files can be downloaded from [wikimedia dump][wikipedia]
The original files are in `data/raw-input/wikipedia`, and they're processed The original files are in `data/raw-input/wikipedia`, and they're processed
by the `wiki2text` rule in `rules.ninja`. by the `wiki2text` rule in `rules.ninja`. Parsing wikipedia requires the
[wiki2text][] package.
[wikipedia]: https://dumps.wikimedia.org/backup-index.html [wikipedia]: https://dumps.wikimedia.org/backup-index.html
[wiki2text]: https://github.com/rspeer/wiki2text
### Leeds Internet Corpus ### Leeds Internet Corpus
@ -113,7 +114,7 @@ by the `convert_leeds` rule in `rules.ninja`.
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
collected by the `ftfy.streamtester` package in 2014. collected by the `ftfy.streamtester` package in 2014.
It's not possible to distribute the text of tweets. However, this process could We are not allowed to distribute the text of tweets. However, this process could
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
a couple of weeks. a couple of weeks.