Merge pull request #23 from LuminosoInsight/readme

Put documentation and examples in the README

Former-commit-id: e43b5ebf7b
This commit is contained in:
Andrew Lin 2015-08-28 17:59:17 -04:00
commit 4e8c15cb71
4 changed files with 160 additions and 7 deletions

152
README.md
View File

@ -23,6 +23,135 @@ install them on Ubuntu:
pip3 install mecab-python3
## Usage
wordfreq provides access to estimates of the frequency with which a word is
used, in 15 languages (see *Supported languages* below). It loads
efficiently-packed data structures that contain all words that appear at least
once per million words.
The most useful function is:
word_frequency(word, lang, wordlist='combined', minimum=0.0)
This function looks up a word's frequency in the given language, returning its
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
frequencies by a million (1e6) to get more readable numbers:
>>> from wordfreq import word_frequency
>>> word_frequency('cafe', 'en') * 1e6
14.45439770745928
>>> word_frequency('café', 'en') * 1e6
4.7863009232263805
>>> word_frequency('cafe', 'fr') * 1e6
2.0417379446695274
>>> word_frequency('café', 'fr') * 1e6
77.62471166286912
The parameters are:
- `word`: a Unicode string containing the word to look up. Ideally the word
is a single token according to our tokenizer, but if not, there is still
hope -- see *Tokenization* below.
- `lang`: the BCP 47 or ISO 639 code of the language to use, such as 'en'.
- `wordlist`: which set of word frequencies to use. Current options are
'combined', which combines up to five different sources, and
'twitter', which returns frequencies observed on Twitter alone.
- `minimum`: If the word is not in the list or has a frequency lower than
`minimum`, return `minimum` instead. In some applications, you'll want
to set `minimum=1e-6` to avoid a discontinuity where the list ends, because
a frequency of 1e-6 (1 per million) is the threshold for being included in
the list at all.
Other functions:
`tokenize(text, lang)` splits text in the given language into words, in the same
way that the words in wordfreq's data were counted in the first place. See
*Tokenization*. Tokenizing Japanese requires the optional dependency `mecab-python3`
to be installed.
`top_n_list(lang, n, wordlist='combined')` returns the most common *n* words in
the list, in descending frequency order.
>>> from wordfreq import top_n_list
>>> top_n_list('en', 10)
['the', 'of', 'to', 'in', 'and', 'a', 'i', 'you', 'is', 'it']
>>> top_n_list('es', 10)
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'no', 'los', 'es']
`iter_wordlist(lang, wordlist='combined')` iterates through all the words in a
wordlist, in descending frequency order.
`get_frequency_dict(lang, wordlist='combined')` returns all the frequencies in
a wordlist as a dictionary, for cases where you'll want to look up a lot of
words and don't need the wrapper that `word_frequency` provides.
`supported_languages(wordlist='combined')` returns a dictionary whose keys are
language codes, and whose values are the data file that will be loaded to
provide the requested wordlist in each language.
`random_words(lang='en', wordlist='combined', nwords=5, bits_per_word=12)`
returns a selection of random words, separated by spaces. `bits_per_word=n`
will select each random word from 2^n words.
If you happen to want an easy way to get [a memorable, xkcd-style
password][xkcd936] with 60 bits of entropy, this function will almost do the
job. In this case, you should actually run the similar function `random_ascii_words`,
limiting the selection to words that can be typed in ASCII.
[xkcd936]: https://xkcd.com/936/
## Sources and supported languages
We compiled word frequencies from five different sources, providing us examples
of word usage on different topics at different levels of formality. The sources
(and the abbreviations we'll use for them) are:
- **GBooks**: Google Books Ngrams 2013
- **LeedsIC**: The Leeds Internet Corpus
- **OpenSub**: OpenSubtitles
- **Twitter**: Messages sampled from Twitter's public stream
- **Wikipedia**: The full text of Wikipedia in 2015
The following 12 languages are well-supported, using at least 3 different sources
of word frequencies:
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────
Arabic ar │ - Yes Yes Yes Yes
German de │ - Yes Yes Yes[1] Yes
English en │ Yes Yes Yes Yes Yes
Spanish es │ - Yes Yes Yes Yes
French fr │ - Yes Yes Yes Yes
Indonesian id │ - - Yes Yes Yes
Italian it │ - Yes Yes Yes Yes
Japanese ja │ - Yes - Yes Yes
Malay ms │ - - Yes Yes Yes
Dutch nl │ - - Yes Yes Yes
Portuguese pt │ - Yes Yes Yes Yes
Russian ru │ - Yes Yes Yes Yes
These 3 languages are only marginally supported so far:
Language Code GBooks LeedsIC OpenSub Twitter Wikipedia
──────────────────┼──────────────────────────────────────────
Greek el │ - Yes Yes - -
Korean ko │ - - - Yes Yes
Chinese zh │ - Yes Yes - -
[1] We've counted the frequencies from tweets in German, such as they are, but
you should be aware that German is not a frequently-used language on Twitter.
Germans just don't tweet that much.
## Tokenization
wordfreq uses the Python package `regex`, which is a more advanced
@ -41,6 +170,27 @@ There are language-specific exceptions:
[uax29]: http://unicode.org/reports/tr29/
When wordfreq's frequency lists are built in the first place, the words are
tokenized according to this function.
Because tokenization in the real world is far from consistent, wordfreq will
also try to deal gracefully when you query it with texts that actually break
into multiple tokens:
>>> word_frequency('New York', 'en')
0.0002632772081925718
The word frequencies are combined with the half-harmonic-mean function in order
to provide an estimate of what their combined frequency would be.
This implicitly assumes that you're asking about words that frequently appear
together. It's not multiplying the frequencies, because that would assume they
are statistically unrelated. So if you give it an uncommon combination of
tokens, it will hugely over-estimate their frequency:
>>> word_frequency('owl-flavored', 'en')
1.3557098723512335e-06
## License
@ -64,7 +214,7 @@ sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies (http://corpus.leeds.ac.uk/list.html)
- The OpenSubtitles Frequency Word Lists, by Invoke IT Limited
- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
(https://invokeit.wordpress.com/frequency-word-lists/)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)

View File

@ -11,18 +11,20 @@ def ninja_to_dot():
print('rankdir="LR";')
for line in sys.stdin:
line = line.rstrip()
parts = line.split(' ')
if parts[0] == 'build':
if line.startswith('build'):
# the output file is the first argument; strip off the colon that
# comes from ninja syntax
outfile = last_component(parts[1][:-1])
operation = parts[2]
infiles = [last_component(part) for part in parts[3:]]
output_text, input_text = line.split(':')
outfiles = [last_component(part) for part in output_text.split(' ')[1:]]
inputs = input_text.strip().split(' ')
infiles = [last_component(part) for part in inputs[1:]]
operation = inputs[0]
for infile in infiles:
if infile == '|':
# external dependencies start here; let's not graph those
break
print('"%s" -> "%s" [label="%s"]' % (infile, outfile, operation))
for outfile in outfiles:
print('"%s" -> "%s" [label="%s"]' % (infile, outfile, operation))
print("}")

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.3 MiB

View File

@ -0,0 +1 @@
ef54b21e931c530f5b75c1cd87c5841cc4691e43