update the wordfreq_builder README

This commit is contained in:
Rob Speer 2015-07-13 11:58:48 -04:00
parent 668f0fa67f
commit 8633e8c2a9

View File

@ -3,9 +3,15 @@
This package builds the data files for [wordfreq](https://github.com/LuminosoInsight/wordfreq).
It requires a fair amount of external input data (42 GB of it, as of this
writing), which is unfortunately not version-controlled. We'd like to remedy
this situation using some sort of framework, but this requires sorting things
out with Tools.
writing), which unfortunately we don't have a plan for how to distribute
outside of Luminoso yet.
The data can be publicly obtained in various ways, so here we'll at least
document where it comes from. We hope to come up with a process that's more
reproducible eventually.
The good news is that you don't need to be able to run this process to use
wordfreq. The built results are already in the `wordfreq/data` directory.
## How to build it
@ -13,6 +19,10 @@ Set up your external hard disk, your networked file system, or whatever thing
you have that's got a couple hundred GB of space free. Let's suppose the
directory of it that you want to use is called `/ext/data`.
Get the input data. At Luminoso, this is available in the directory
`/nfs/broadway/data/wordfreq_builder`. The sections below explain where the
data comes from.
Copy the input data:
cp -rv /nfs/broadway/data/wordfreq_builder /ext/data/
@ -40,35 +50,7 @@ Start the build, and find something else to do for a few hours:
You can copy the results into wordfreq with this command (supposing that
$WORDFREQ points to your wordfreq repo):
cp data/generated/combined/*.msgpack.gz $WORDFREQ/wordfreq/data/
## The dBpack data format
We pack the wordlists into a small amount of space using a format that I
call "dBpack". This is the data that's found in the .msgpack.gz files that
are output at the end. The format is as follows:
- The file on disk is a gzipped file in msgpack format, which decodes to a
list of lists of words.
- Each inner list of words corresponds to a particular word frequency,
rounded to the nearest decibel. 0 dB represents a word that occurs with
probability 1, so it is the only word in the data (this of course doesn't
happen). -20 dB represents a word that occurs once per 100 tokens, -30 dB
represents a word that occurs once per 1000 tokens, and so on.
- The index of each list within the overall list is the negative of its
frequency in decibels.
- Each inner list is sorted in alphabetical order.
As an example, consider a corpus consisting only of the words "red fish
blue fish". The word "fish" occurs as 50% of tokens (-3 dB), while "red"
and "blue" occur as 25% of tokens (-6 dB). The dBpack file of their word
frequencies would decode to this list:
[[], [], [], ['fish'], [], [], ['blue', 'red']]
cp data/dist/*.msgpack.gz ../wordfreq/data/
## The Ninja build process
@ -108,17 +90,23 @@ project that collected wordlists in assorted languages by crawling the Web.
The results are messy, but they're something. We've been using them for quite
a while.
These files can be downloaded from the [Leeds corpus page][leeds].
The original files are in `data/source-lists/leeds`, and they're processed
by the `convert_leeds` rule in `rules.ninja`.
[leeds]: http://corpus.leeds.ac.uk/list.html
### Twitter
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
collected by the `ftfy.streamtester` package in 2014.
It takes a lot of work to convert these tweets into data that's usable for
wordfreq. They have to be language-detected and then tokenized. So the result
of language-detection and tokenization is stored in `data/intermediate/twitter`.
It's not possible to distribute the text of tweets. However, this process could
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
a couple of weeks.
[ftfy]: https://github.com/LuminosoInsight/python-ftfy
### Google Books