Merge pull request #23 from LuminosoInsight/readme

Put documentation and examples in the README Former-commit-id: e43b5ebf7b
2024-12-23 17:31:41 +00:00 · 2015-08-28 17:59:17 -04:00 · 2015-08-28 17:59:17 -04:00 · 4e8c15cb71
commit 4e8c15cb71
parent 9fedede771 942761d2f6
4 changed files with 160 additions and 7 deletions
--- a/README.md
+++ b/README.md
@ -23,6 +23,135 @@ install them on Ubuntu:
    pip3 install mecab-python3


+## Usage
+
+wordfreq provides access to estimates of the frequency with which a word is
+used, in 15 languages (see *Supported languages* below). It loads
+efficiently-packed data structures that contain all words that appear at least
+once per million words.
+
+The most useful function is:
+
+    word_frequency(word, lang, wordlist='combined', minimum=0.0)
+
+This function looks up a word's frequency in the given language, returning its
+frequency as a decimal between 0 and 1. In these examples, we'll multiply the
+frequencies by a million (1e6) to get more readable numbers:
+
+    >>> from wordfreq import word_frequency
+    >>> word_frequency('cafe', 'en') * 1e6
+    14.45439770745928
+
+    >>> word_frequency('café', 'en') * 1e6
+    4.7863009232263805
+
+    >>> word_frequency('cafe', 'fr') * 1e6
+    2.0417379446695274
+
+    >>> word_frequency('café', 'fr') * 1e6
+    77.62471166286912
+
+The parameters are:
+
+- `word`: a Unicode string containing the word to look up. Ideally the word
+  is a single token according to our tokenizer, but if not, there is still
+  hope -- see *Tokenization* below.
+
+- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
+
+- `wordlist`: which set of word frequencies to use. Current options are
+  'combined', which combines up to five different sources, and
+  'twitter', which returns frequencies observed on Twitter alone.
+
+- `minimum`: If the word is not in the list or has a frequency lower than
+  `minimum`, return `minimum` instead. In some applications, you'll want
+  to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
+  a frequency of 1e-6 (1 per million) is the threshold for being included in
+  the list at all.
+
+Other functions:
+
+`tokenize(text, lang)` splits text in the given language into words, in the same
+way that the words in wordfreq's data were counted in the first place. See
+*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
+to be installed.
+
+`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
+the list, in descending frequency order.
+
+    >>> from wordfreq import top_n_list
+    >>> top_n_list('en', 10)
+    ['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']
+
+    >>> top_n_list('es', 10)
+    ['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']
+
+`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
+wordlist, in descending frequency order.
+
+`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
+a wordlist as a dictionary, for cases where you'll want to look up a lot of
+words and don't need the wrapper that `word_frequency` provides.
+
+`supported_languages(wordlist='combined')` returns a dictionary whose keys are
+language codes, and whose values are the data file that will be loaded to
+provide the requested wordlist in each language.
+
+`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
+returns a selection of random words, separated by spaces. `bits_per_word=n`
+will select each random word from 2^n words.
+
+If you happen to want an easy way to get [a memorable, xkcd-style
+password][xkcd936] with 60 bits of entropy, this function will almost do the
+job. In this case, you should actually run the similar function `random_ascii_words`,
+limiting the selection to words that can be typed in ASCII.
+
+[xkcd936]: https://xkcd.com/936/
+
+
+## Sources and supported languages
+
+We compiled word frequencies from five different sources, providing us examples
+of word usage on different topics at different levels of formality. The sources
+(and the abbreviations we'll use for them) are:
+
+- **GBooks**: Google Books Ngrams 2013
+- **LeedsIC**: The Leeds Internet Corpus
+- **OpenSub**: OpenSubtitles
+- **Twitter**: Messages sampled from Twitter's public stream
+- **Wikipedia**: The full text of Wikipedia in 2015
+
+The following 12 languages are well-supported, using at least 3 different sources
+of word frequencies:
+
+    Language    Code    GBooks  LeedsIC OpenSub Twitter Wikipedia
+    ──────────────────┼──────────────────────────────────────────
+    Arabic      ar    │ -       Yes     Yes     Yes     Yes
+    German      de    │ -       Yes     Yes     Yes[1]  Yes
+    English     en    │ Yes     Yes     Yes     Yes     Yes
+    Spanish     es    │ -       Yes     Yes     Yes     Yes
+    French      fr    │ -       Yes     Yes     Yes     Yes
+    Indonesian  id    │ -       -       Yes     Yes     Yes
+    Italian     it    │ -       Yes     Yes     Yes     Yes
+    Japanese    ja    │ -       Yes     -       Yes     Yes
+    Malay       ms    │ -       -       Yes     Yes     Yes
+    Dutch       nl    │ -       -       Yes     Yes     Yes
+    Portuguese  pt    │ -       Yes     Yes     Yes     Yes
+    Russian     ru    │ -       Yes     Yes     Yes     Yes
+
+These 3 languages are only marginally supported so far:
+
+    Language    Code    GBooks  LeedsIC OpenSub Twitter Wikipedia
+    ──────────────────┼──────────────────────────────────────────
+    Greek       el    │ -       Yes     Yes     -       -
+    Korean      ko    │ -       -       -       Yes     Yes
+    Chinese     zh    │ -       Yes     Yes     -       -
+
+[1] We've counted the frequencies from tweets in German, such as they are, but
+you should be aware that German is not a frequently-used language on Twitter.
+Germans just don't tweet that much.
+
+
 ## Tokenization

 wordfreq uses the Python package `regex`, which is a more advanced
@ -41,6 +170,27 @@ There are language-specific exceptions:

 [uax29]: http://unicode.org/reports/tr29/

+When wordfreq's frequency lists are built in the first place, the words are
+tokenized according to this function.
+
+Because tokenization in the real world is far from consistent, wordfreq will
+also try to deal gracefully when you query it with texts that actually break
+into multiple tokens:
+
+    >>> word_frequency('New York', 'en')
+    0.0002632772081925718
+
+The word frequencies are combined with the half-harmonic-mean function in order
+to provide an estimate of what their combined frequency would be.
+
+This implicitly assumes that you're asking about words that frequently appear
+together. It's not multiplying the frequencies, because that would assume they
+are statistically unrelated. So if you give it an uncommon combination of
+tokens, it will hugely over-estimate their frequency:
+
+    >>> word_frequency('owl-flavored', 'en')
+    1.3557098723512335e-06
+

 ## License

@ -64,7 +214,7 @@ sources:
 - The Leeds Internet Corpus, from the University of Leeds Centre for Translation
  Studies (http://corpus.leeds.ac.uk/list.html)

- The OpenSubtitles Frequency Word Lists, by Invoke IT Limited
+- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
  (https://invokeit.wordpress.com/frequency-word-lists/)

 - Wikipedia, the free encyclopedia (http://www.wikipedia.org)
--- a/scripts/ninja2dot.py
+++ b/scripts/ninja2dot.py
@ -11,17 +11,19 @@ def ninja_to_dot():
    print('rankdir="LR";')
    for line in sys.stdin:
        line = line.rstrip()
-        parts = line.split(' ')
-        if parts[0] == 'build':
+        if line.startswith('build'):
            # the output file is the first argument; strip off the colon that
            # comes from ninja syntax
-            outfile = last_component(parts[1][:-1])
-            operation = parts[2]
-            infiles = [last_component(part) for part in parts[3:]]
+            output_text, input_text = line.split(':')
+            outfiles = [last_component(part) for part in output_text.split(' ')[1:]]
+            inputs = input_text.strip().split(' ')
+            infiles = [last_component(part) for part in inputs[1:]]
+            operation = inputs[0]
            for infile in infiles:
                if infile == '|':
                    # external dependencies start here; let's not graph those
                    break
+                for outfile in outfiles:
                    print('"%s" -> "%s" [label="%s"]' % (infile, outfile, operation))
    print("}")

--- a/wordfreq_builder/build.png
+++ b/wordfreq_builder/build.png
--- a/wordfreq_builder/build.png.REMOVED.git-id
+++ b/wordfreq_builder/build.png.REMOVED.git-id
@ -0,0 +1 @@
+ef54b21e931c530f5b75c1cd87c5841cc4691e43