fixed README

Former-commit-id: 0a085132f4
2024-12-23 17:31:41 +00:00 · 2015-07-17 14:35:43 -04:00 · 2015-07-17 14:35:43 -04:00 · 3e4643f9c4
commit 3e4643f9c4
parent 73bacc659d
1 changed files with 12 additions and 2 deletions
--- a/wordfreq_builder/README.md
+++ b/wordfreq_builder/README.md
@ -83,6 +83,17 @@ The specific rules are described by the comments in `rules.ninja`.

 ## Data sources

+### Wikipedia
+
+Wikipedia is a "free-access, free-content Internet encyclopedia".
+
+These files can be downloaded from [wikimedia dump][wikipedia]
+
+The original files are in `data/raw-input/wikipedia`, and they're processed
+by the `wiki2text` rule in `rules.ninja`.
+
+[wikipedia]: https://dumps.wikimedia.org/backup-index.html
+
 ### Leeds Internet Corpus

 Also known as the "Web as Corpus" project, this is a University of Leeds
@ -119,7 +130,7 @@ because it's cleaner. The data comes in the form of 99 gzipped text files in

 ### OpenSubtitles

-[Some guy](https://invokeit.wordpress.com/frequency-word-lists/) made word
+[Hermit Dave](https://invokeit.wordpress.com/frequency-word-lists/) made word
 frequency lists out of the subtitle text on OpenSubtitles. This data was
 used to make Wiktionary word frequency lists at one point, but it's been
 updated significantly since the version Wiktionary got.
@ -145,4 +156,3 @@ longer represents the words 'don' and 'won', as we assume most of their
 frequency comes from "don't" and "won't". Words that turned into similarly
 common words, however, were left alone: this list doesn't represent "can't"
 because the word was left as "can".
-