update the wordfreq_builder README

2024-12-23 17:31:41 +00:00 · 2015-07-13 11:58:48 -04:00 · 2015-07-13 11:58:48 -04:00 · 8633e8c2a9
commit 8633e8c2a9
parent 668f0fa67f
1 changed files with 23 additions and 35 deletions
--- a/wordfreq_builder/README.md
+++ b/wordfreq_builder/README.md
@ -3,9 +3,15 @@
 This package builds the data files for [wordfreq](https://github.com/LuminosoInsight/wordfreq).

 It requires a fair amount of external input data (42 GB of it, as of this
-writing), which is unfortunately not version-controlled. We'd like to remedy
-this situation using some sort of framework, but this requires sorting things
-out with Tools.
+writing), which unfortunately we don't have a plan for how to distribute
+outside of Luminoso yet.
+
+The data can be publicly obtained in various ways, so here we'll at least
+document where it comes from. We hope to come up with a process that's more
+reproducible eventually.
+
+The good news is that you don't need to be able to run this process to use
+wordfreq. The built results are already in the `wordfreq/data` directory.

 ## How to build it

@ -13,6 +19,10 @@ Set up your external hard disk, your networked file system, or whatever thing
 you have that's got a couple hundred GB of space free. Let's suppose the
 directory of it that you want to use is called `/ext/data`.

+Get the input data. At Luminoso, this is available in the directory
+`/nfs/broadway/data/wordfreq_builder`. The sections below explain where the
+data comes from.
+
 Copy the input data:

    cp -rv /nfs/broadway/data/wordfreq_builder /ext/data/
@ -40,35 +50,7 @@ Start the build, and find something else to do for a few hours:
 You can copy the results into wordfreq with this command (supposing that
 $WORDFREQ points to your wordfreq repo):

-    cp data/generated/combined/*.msgpack.gz $WORDFREQ/wordfreq/data/
-
-
-## The dBpack data format
-
-We pack the wordlists into a small amount of space using a format that I
-call "dBpack". This is the data that's found in the .msgpack.gz files that
-are output at the end. The format is as follows:
-
- The file on disk is a gzipped file in msgpack format, which decodes to a
-  list of lists of words.
-
- Each inner list of words corresponds to a particular word frequency,
-  rounded to the nearest decibel. 0 dB represents a word that occurs with
-  probability 1, so it is the only word in the data (this of course doesn't
-  happen). -20 dB represents a word that occurs once per 100 tokens, -30 dB
-  represents a word that occurs once per 1000 tokens, and so on.
-
- The index of each list within the overall list is the negative of its
-  frequency in decibels.
-
- Each inner list is sorted in alphabetical order.
-
-As an example, consider a corpus consisting only of the words "red fish
-blue fish". The word "fish" occurs as 50% of tokens (-3 dB), while "red"
-and "blue" occur as 25% of tokens (-6 dB). The dBpack file of their word
-frequencies would decode to this list:
-
-    [[], [], [], ['fish'], [], [], ['blue', 'red']]
+    cp data/dist/*.msgpack.gz ../wordfreq/data/


 ## The Ninja build process
@ -108,17 +90,23 @@ project that collected wordlists in assorted languages by crawling the Web.
 The results are messy, but they're something. We've been using them for quite
 a while.

+These files can be downloaded from the [Leeds corpus page][leeds].
+
 The original files are in `data/source-lists/leeds`, and they're processed
 by the `convert_leeds` rule in `rules.ninja`.

+[leeds]: http://corpus.leeds.ac.uk/list.html
+
 ### Twitter

 The file `data/raw-input/twitter/all-2014.txt` contains about 72 million tweets
 collected by the `ftfy.streamtester` package in 2014.

-It takes a lot of work to convert these tweets into data that's usable for
-wordfreq. They have to be language-detected and then tokenized. So the result
-of language-detection and tokenization is stored in `data/intermediate/twitter`.
+It's not possible to distribute the text of tweets. However, this process could
+be reproduced by running `ftfy.streamtester`, part of the [ftfy][] package, for
+a couple of weeks.
+
+[ftfy]: https://github.com/LuminosoInsight/python-ftfy

 ### Google Books