mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
update the wordfreq_builder README
This commit is contained in:
parent
668f0fa67f
commit
8633e8c2a9
@ -3,9 +3,15 @@
|
||||
This package builds the data files for [wordfreq](https://github.com/LuminosoInsight/wordfreq).
|
||||
|
||||
It requires a fair amount of external input data (42 GB of it, as of this
|
||||
writing), which is unfortunately not version-controlled. We'd like to remedy
|
||||
this situation using some sort of framework, but this requires sorting things
|
||||
out with Tools.
|
||||
writing), which unfortunately we don't have a plan for how to distribute
|
||||
outside of Luminoso yet.
|
||||
|
||||
The data can be publicly obtained in various ways, so here we'll at least
|
||||
document where it comes from. We hope to come up with a process that's more
|
||||
reproducible eventually.
|
||||
|
||||
The good news is that you don't need to be able to run this process to use
|
||||
wordfreq. The built results are already in the `wordfreq/data` directory.
|
||||
|
||||
## How to build it
|
||||
|
||||
@ -13,6 +19,10 @@ Set up your external hard disk, your networked file system, or whatever thing
|
||||
you have that's got a couple hundred GB of space free. Let's suppose the
|
||||
directory of it that you want to use is called `/ext/data`.
|
||||
|
||||
Get the input data. At Luminoso, this is available in the directory
|
||||
`/nfs/broadway/data/wordfreq_builder`. The sections below explain where the
|
||||
data comes from.
|
||||
|
||||
Copy the input data:
|
||||
|
||||
cp -rv /nfs/broadway/data/wordfreq_builder /ext/data/
|
||||
@ -40,35 +50,7 @@ Start the build, and find something else to do for a few hours:
|
||||
You can copy the results into wordfreq with this command (supposing that
|
||||
$WORDFREQ points to your wordfreq repo):
|
||||
|
||||
cp data/generated/combined/*.msgpack.gz $WORDFREQ/wordfreq/data/
|
||||
|
||||
|
||||
## The dBpack data format
|
||||
|
||||
We pack the wordlists into a small amount of space using a format that I
|
||||
call "dBpack". This is the data that's found in the .msgpack.gz files that
|
||||
are output at the end. The format is as follows:
|
||||
|
||||
- The file on disk is a gzipped file in msgpack format, which decodes to a
|
||||
list of lists of words.
|
||||
|
||||
- Each inner list of words corresponds to a particular word frequency,
|
||||
rounded to the nearest decibel. 0 dB represents a word that occurs with
|
||||
probability 1, so it is the only word in the data (this of course doesn't
|
||||
happen). -20 dB represents a word that occurs once per 100 tokens, -30 dB
|
||||
represents a word that occurs once per 1000 tokens, and so on.
|
||||
|
||||
- The index of each list within the overall list is the negative of its
|
||||
frequency in decibels.
|
||||
|
||||
- Each inner list is sorted in alphabetical order.
|
||||
|
||||
As an example, consider a corpus consisting only of the words "red fish
|
||||
blue fish". The word "fish" occurs as 50% of tokens (-3 dB), while "red"
|
||||
and "blue" occur as 25% of tokens (-6 dB). The dBpack file of their word
|
||||
frequencies would decode to this list:
|
||||
|
||||
[[], [], [], ['fish'], [], [], ['blue', 'red']]
|
||||
cp data/dist/*.msgpack.gz ../wordfreq/data/
|
||||
|
||||
|
||||
## The Ninja build process
|
||||
@ -108,17 +90,23 @@ project that collected wordlists in assorted languages by crawling the Web.
|
||||
The results are messy, but they're something. We've been using them for quite
|
||||
a while.
|
||||
|
||||
These files can be downloaded from the [Leeds corpus page][leeds].
|
||||
|
||||
The original files are in `data/source-lists/leeds`, and they're processed
|
||||
by the `convert_leeds` rule in `rules.ninja`.
|
||||
|
||||
[leeds]: http://corpus.leeds.ac.uk/list.html
|
||||
|
||||
### Twitter
|
||||
|
||||
The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
|
||||
collected by the `ftfy.streamtester` package in 2014.
|
||||
|
||||
It takes a lot of work to convert these tweets into data that's usable for
|
||||
wordfreq. They have to be language-detected and then tokenized. So the result
|
||||
of language-detection and tokenization is stored in `data/intermediate/twitter`.
|
||||
It's not possible to distribute the text of tweets. However, this process could
|
||||
be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
|
||||
a couple of weeks.
|
||||
|
||||
[ftfy]: https://github.com/LuminosoInsight/python-ftfy
|
||||
|
||||
### Google Books
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user