From 772c0cddd1b5ea4db3f74f64dec02226ece50865 Mon Sep 17 00:00:00 2001 From: Joshua Chin Date: Fri, 17 Jul 2015 14:40:33 -0400 Subject: [PATCH] more README fixes --- wordfreq_builder/README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/wordfreq_builder/README.md b/wordfreq_builder/README.md index 5d748c4..f384caf 100644 --- a/wordfreq_builder/README.md +++ b/wordfreq_builder/README.md @@ -47,8 +47,7 @@ Start the build, and find something else to do for a few hours: ninja -v -You can copy the results into wordfreq with this command (supposing that -$WORDFREQ points to your wordfreq repo): +You can copy the results into wordfreq with this command: cp data/dist/*.msgpack.gz ../wordfreq/data/ @@ -90,9 +89,11 @@ Wikipedia is a "free-access, free-content Internet encyclopedia". These files can be downloaded from [wikimedia dump][wikipedia] The original files are in `data/raw-input/wikipedia`, and they're processed -by the `wiki2text` rule in `rules.ninja`. +by the `wiki2text` rule in `rules.ninja`. Parsing wikipedia requires the +[wiki2text][] package. [wikipedia]: https://dumps.wikimedia.org/backup-index.html +[wiki2text]: https://github.com/rspeer/wiki2text ### Leeds Internet Corpus @@ -113,7 +114,7 @@ by the `convert_leeds` rule in `rules.ninja`. The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets collected by the `ftfy.streamtester` package in 2014. -It's not possible to distribute the text of tweets. However, this process could +We are not allowed to distribute the text of tweets. However, this process could be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for a couple of weeks.