mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
more README fixes
This commit is contained in:
parent
0a085132f4
commit
772c0cddd1
@ -47,8 +47,7 @@ Start the build, and find something else to do for a few hours:
|
||||
|
||||
ninja -v
|
||||
|
||||
You can copy the results into wordfreq with this command (supposing that
|
||||
$WORDFREQ points to your wordfreq repo):
|
||||
You can copy the results into wordfreq with this command:
|
||||
|
||||
cp data/dist/*.msgpack.gz ../wordfreq/data/
|
||||
|
||||
@ -90,9 +89,11 @@ Wikipedia is a "free-access, free-content Internet encyclopedia".
|
||||
These files can be downloaded from [wikimedia dump][wikipedia]
|
||||
|
||||
The original files are in `data/raw-input/wikipedia`, and they're processed
|
||||
by the `wiki2text` rule in `rules.ninja`.
|
||||
by the `wiki2text` rule in `rules.ninja`. Parsing wikipedia requires the
|
||||
[wiki2text][] package.
|
||||
|
||||
[wikipedia]: https://dumps.wikimedia.org/backup-index.html
|
||||
[wiki2text]: https://github.com/rspeer/wiki2text
|
||||
|
||||
### Leeds Internet Corpus
|
||||
|
||||
@ -113,7 +114,7 @@ by the `convert_leeds` rule in `rules.ninja`.
|
||||
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
|
||||
collected by the `ftfy.streamtester` package in 2014.
|
||||
|
||||
It's not possible to distribute the text of tweets. However, this process could
|
||||
We are not allowed to distribute the text of tweets. However, this process could
|
||||
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
|
||||
a couple of weeks.
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user