more README fixes

This commit is contained in:
Joshua Chin 2015-07-17 14:40:33 -04:00
parent 0a085132f4
commit 772c0cddd1

View File

@ -47,8 +47,7 @@ Start the build, and find something else to do for a few hours:
ninja -v
You can copy the results into wordfreq with this command (supposing that
$WORDFREQ points to your wordfreq repo):
You can copy the results into wordfreq with this command:
cp data/dist/*.msgpack.gz ../wordfreq/data/
@ -90,9 +89,11 @@ Wikipedia is a "free-access, free-content Internet encyclopedia".
These files can be downloaded from [wikimedia dump][wikipedia]
The original files are in `data/raw-input/wikipedia`, and they're processed
by the `wiki2text` rule in `rules.ninja`.
by the `wiki2text` rule in `rules.ninja`. Parsing wikipedia requires the
[wiki2text][] package.
[wikipedia]: https://dumps.wikimedia.org/backup-index.html
[wiki2text]: https://github.com/rspeer/wiki2text
### Leeds Internet Corpus
@ -113,7 +114,7 @@ by the `convert_leeds` rule in `rules.ninja`.
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
collected by the `ftfy.streamtester` package in 2014.
It's not possible to distribute the text of tweets. However, this process could
We are not allowed to distribute the text of tweets. However, this process could
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
a couple of weeks.