mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-24 09:51:38 +00:00
more README fixes
This commit is contained in:
parent
0a085132f4
commit
772c0cddd1
@ -47,8 +47,7 @@ Start the build, and find something else to do for a few hours:
|
|||||||
|
|
||||||
ninja -v
|
ninja -v
|
||||||
|
|
||||||
You can copy the results into wordfreq with this command (supposing that
|
You can copy the results into wordfreq with this command:
|
||||||
$WORDFREQ points to your wordfreq repo):
|
|
||||||
|
|
||||||
cp data/dist/*.msgpack.gz ../wordfreq/data/
|
cp data/dist/*.msgpack.gz ../wordfreq/data/
|
||||||
|
|
||||||
@ -90,9 +89,11 @@ Wikipedia is a "free-access, free-content Internet encyclopedia".
|
|||||||
These files can be downloaded from [wikimedia dump][wikipedia]
|
These files can be downloaded from [wikimedia dump][wikipedia]
|
||||||
|
|
||||||
The original files are in `data/raw-input/wikipedia`, and they're processed
|
The original files are in `data/raw-input/wikipedia`, and they're processed
|
||||||
by the `wiki2text` rule in `rules.ninja`.
|
by the `wiki2text` rule in `rules.ninja`. Parsing wikipedia requires the
|
||||||
|
[wiki2text][] package.
|
||||||
|
|
||||||
[wikipedia]: https://dumps.wikimedia.org/backup-index.html
|
[wikipedia]: https://dumps.wikimedia.org/backup-index.html
|
||||||
|
[wiki2text]: https://github.com/rspeer/wiki2text
|
||||||
|
|
||||||
### Leeds Internet Corpus
|
### Leeds Internet Corpus
|
||||||
|
|
||||||
@ -113,7 +114,7 @@ by the `convert_leeds` rule in `rules.ninja`.
|
|||||||
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
|
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
|
||||||
collected by the `ftfy.streamtester` package in 2014.
|
collected by the `ftfy.streamtester` package in 2014.
|
||||||
|
|
||||||
It's not possible to distribute the text of tweets. However, this process could
|
We are not allowed to distribute the text of tweets. However, this process could
|
||||||
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
|
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
|
||||||
a couple of weeks.
|
a couple of weeks.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user