mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
update README and CHANGELOG
This commit is contained in:
parent
7a32b56c1c
commit
c6552f923f
17
CHANGELOG.md
17
CHANGELOG.md
@ -1,3 +1,20 @@
|
|||||||
|
## Version 2.1 (2018-06-18)
|
||||||
|
|
||||||
|
Data changes:
|
||||||
|
|
||||||
|
- Updated to the data from the latest Exquisite Corpus, which adds the
|
||||||
|
ParaCrawl web crawl and updates to OpenSubtitles 2018
|
||||||
|
- Added small word list for Latvian
|
||||||
|
- Added large word list for Czech
|
||||||
|
- The Dutch large word list once again has 5 data sources
|
||||||
|
|
||||||
|
Library change:
|
||||||
|
|
||||||
|
- The output of `word_frequency` is rounded to three significant digits. This
|
||||||
|
provides friendlier output, and better reflects the precision of the
|
||||||
|
underlying data anyway.
|
||||||
|
|
||||||
|
|
||||||
## Version 2.0.1 (2018-05-01)
|
## Version 2.0.1 (2018-05-01)
|
||||||
|
|
||||||
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
||||||
|
99
README.md
99
README.md
@ -1,4 +1,5 @@
|
|||||||
wordfreq is a Python library for looking up the frequencies of words in many languages, based on many sources of data.
|
wordfreq is a Python library for looking up the frequencies of words in many
|
||||||
|
languages, based on many sources of data.
|
||||||
|
|
||||||
Author: Robyn Speer
|
Author: Robyn Speer
|
||||||
|
|
||||||
@ -22,7 +23,8 @@ steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 35 languages (see *Supported languages* below).
|
used, in 36 languages (see *Supported languages* below). It uses many different
|
||||||
|
data sources, not just one corpus.
|
||||||
|
|
||||||
It provides both 'small' and 'large' wordlists:
|
It provides both 'small' and 'large' wordlists:
|
||||||
|
|
||||||
@ -39,21 +41,20 @@ The most straightforward function for looking up frequencies is:
|
|||||||
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
word_frequency(word, lang, wordlist='best', minimum=0.0)
|
||||||
|
|
||||||
This function looks up a word's frequency in the given language, returning its
|
This function looks up a word's frequency in the given language, returning its
|
||||||
frequency as a decimal between 0 and 1. In these examples, we'll multiply the
|
frequency as a decimal between 0 and 1.
|
||||||
frequencies by a million (1e6) to get more readable numbers:
|
|
||||||
|
|
||||||
>>> from wordfreq import word_frequency
|
>>> from wordfreq import word_frequency
|
||||||
>>> word_frequency('cafe', 'en') * 1e6
|
>>> word_frequency('cafe', 'en')
|
||||||
11.748975549395302
|
1.07e-05
|
||||||
|
|
||||||
>>> word_frequency('café', 'en') * 1e6
|
>>> word_frequency('café', 'en')
|
||||||
3.890451449942805
|
5.89e-06
|
||||||
|
|
||||||
>>> word_frequency('cafe', 'fr') * 1e6
|
>>> word_frequency('cafe', 'fr')
|
||||||
1.4454397707459279
|
1.51e-06
|
||||||
|
|
||||||
>>> word_frequency('café', 'fr') * 1e6
|
>>> word_frequency('café', 'fr')
|
||||||
53.70317963702532
|
5.25e-05
|
||||||
|
|
||||||
|
|
||||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||||
@ -74,13 +75,13 @@ one occurrence per billion words.
|
|||||||
7.77
|
7.77
|
||||||
|
|
||||||
>>> zipf_frequency('word', 'en')
|
>>> zipf_frequency('word', 'en')
|
||||||
5.32
|
5.29
|
||||||
|
|
||||||
>>> zipf_frequency('frequency', 'en')
|
>>> zipf_frequency('frequency', 'en')
|
||||||
4.38
|
4.42
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en')
|
>>> zipf_frequency('zipf', 'en')
|
||||||
1.32
|
1.55
|
||||||
|
|
||||||
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||||
0.0
|
0.0
|
||||||
@ -112,13 +113,29 @@ packing the words into frequency bins.
|
|||||||
In wordfreq, all words that have the same Zipf frequency rounded to the nearest
|
In wordfreq, all words that have the same Zipf frequency rounded to the nearest
|
||||||
hundredth have the same frequency. We don't store any more precision than that.
|
hundredth have the same frequency. We don't store any more precision than that.
|
||||||
So instead of having to store that the frequency of a word is
|
So instead of having to store that the frequency of a word is
|
||||||
.000011748975549395302, information that is mostly meaningless, we just store
|
.000011748975549395302, where most of those digits are meaningless, we just store
|
||||||
the 600 possible frequency bins and the words they contain.
|
the frequency bins and the words they contain.
|
||||||
|
|
||||||
Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
||||||
precision no matter how far down you are in the word list. The frequency of any
|
precision no matter how far down you are in the word list. The frequency of any
|
||||||
word is precise to within 1%. (This is not a claim about _accuracy_, which it's
|
word is precise to within 1%.
|
||||||
unclear how you'd even measure, just about _precision_.)
|
|
||||||
|
(This is not a claim about _accuracy_, but about _precision_. We believe that
|
||||||
|
the way we use multiple data sources and discard outliers makes wordfreq a
|
||||||
|
more accurate measurement of the way these words are really used in written
|
||||||
|
language, but it's unclear how one would measure this accuracy.)
|
||||||
|
|
||||||
|
|
||||||
|
## The figure-skating metric
|
||||||
|
|
||||||
|
We combine word frequencies from different sources in a way that's designed
|
||||||
|
to minimize the impact of outliers. The method reminds me of the scoring system
|
||||||
|
in Olympic figure skating:
|
||||||
|
|
||||||
|
- Find the frequency of each word according to each data source.
|
||||||
|
- For each word, drop the sources that give it the highest and lowest frequency.
|
||||||
|
- Average the remaining frequencies.
|
||||||
|
- Rescale the resulting frequency list to add up to 1.
|
||||||
|
|
||||||
|
|
||||||
## Sources and supported languages
|
## Sources and supported languages
|
||||||
@ -133,14 +150,16 @@ Exquisite Corpus compiles 8 different domains of text, some of which themselves
|
|||||||
come from multiple sources:
|
come from multiple sources:
|
||||||
|
|
||||||
- **Wikipedia**, representing encyclopedic text
|
- **Wikipedia**, representing encyclopedic text
|
||||||
- **Subtitles**, from OPUS OpenSubtitles 2016 and SUBTLEX
|
- **Subtitles**, from OPUS OpenSubtitles 2018 and SUBTLEX
|
||||||
- **News**, from NewsCrawl 2014 and GlobalVoices
|
- **News**, from NewsCrawl 2014 and GlobalVoices
|
||||||
- **Books**, from Google Books Ngrams 2012
|
- **Books**, from Google Books Ngrams 2012
|
||||||
- **Web** text, from the Leeds Internet Corpus and the MOKK Hungarian Webcorpus
|
- **Web** text, from ParaCrawl, the Leeds Internet Corpus, and the MOKK
|
||||||
|
Hungarian Webcorpus
|
||||||
- **Twitter**, representing short-form social media
|
- **Twitter**, representing short-form social media
|
||||||
- **Reddit**, representing potentially longer Internet comments
|
- **Reddit**, representing potentially longer Internet comments
|
||||||
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
|
- **Miscellaneous** word frequencies: in Chinese, we import a free wordlist
|
||||||
that comes with the Jieba word segmenter, whose provenance we don't really know
|
that comes with the Jieba word segmenter, whose provenance we don't really
|
||||||
|
know
|
||||||
|
|
||||||
The following languages are supported, with reasonable tokenization and at
|
The following languages are supported, with reasonable tokenization and at
|
||||||
least 3 different sources of word frequencies:
|
least 3 different sources of word frequencies:
|
||||||
@ -225,10 +244,15 @@ wordlist, in descending frequency order.
|
|||||||
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
a wordlist as a dictionary, for cases where you'll want to look up a lot of
|
||||||
words and don't need the wrapper that `word_frequency` provides.
|
words and don't need the wrapper that `word_frequency` provides.
|
||||||
|
|
||||||
`supported_languages(wordlist='best')` returns a dictionary whose keys are
|
`available_languages(wordlist='best')` returns a dictionary whose keys are
|
||||||
language codes, and whose values are the data file that will be loaded to
|
language codes, and whose values are the data file that will be loaded to
|
||||||
provide the requested wordlist in each language.
|
provide the requested wordlist in each language.
|
||||||
|
|
||||||
|
`get_language_info(lang)` returns a dictionary of information about how we
|
||||||
|
preprocess text in this language, such as what script we expect it to be
|
||||||
|
written in, which characters we normalize together, and how we tokenize it.
|
||||||
|
See its docstring for more information.
|
||||||
|
|
||||||
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
`random_words(lang='en', wordlist='best', nwords=5, bits_per_word=12)`
|
||||||
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
returns a selection of random words, separated by spaces. `bits_per_word=n`
|
||||||
will select each random word from 2^n words.
|
will select each random word from 2^n words.
|
||||||
@ -274,9 +298,9 @@ also try to deal gracefully when you query it with texts that actually break
|
|||||||
into multiple tokens:
|
into multiple tokens:
|
||||||
|
|
||||||
>>> zipf_frequency('New York', 'en')
|
>>> zipf_frequency('New York', 'en')
|
||||||
5.35
|
5.28
|
||||||
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
>>> zipf_frequency('北京地铁', 'zh') # "Beijing Subway"
|
||||||
3.54
|
3.57
|
||||||
|
|
||||||
The word frequencies are combined with the half-harmonic-mean function in order
|
The word frequencies are combined with the half-harmonic-mean function in order
|
||||||
to provide an estimate of what their combined frequency would be. In Chinese,
|
to provide an estimate of what their combined frequency would be. In Chinese,
|
||||||
@ -291,7 +315,7 @@ you give it an uncommon combination of tokens, it will hugely over-estimate
|
|||||||
their frequency:
|
their frequency:
|
||||||
|
|
||||||
>>> zipf_frequency('owl-flavored', 'en')
|
>>> zipf_frequency('owl-flavored', 'en')
|
||||||
3.18
|
3.2
|
||||||
|
|
||||||
|
|
||||||
## Multi-script languages
|
## Multi-script languages
|
||||||
@ -430,14 +454,14 @@ sources:
|
|||||||
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
||||||
Studies (http://corpus.leeds.ac.uk/list.html)
|
Studies (http://corpus.leeds.ac.uk/list.html)
|
||||||
|
|
||||||
- The OpenSubtitles Frequency Word Lists, compiled by Hermit Dave
|
|
||||||
(https://invokeit.wordpress.com/frequency-word-lists/)
|
|
||||||
|
|
||||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||||
|
|
||||||
|
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
|
||||||
|
|
||||||
It contains data from OPUS OpenSubtitles 2018
|
It contains data from OPUS OpenSubtitles 2018
|
||||||
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
|
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
|
||||||
OpenSubtitles project (http://www.opensubtitles.org/).
|
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
|
||||||
|
attribution to OpenSubtitles.
|
||||||
|
|
||||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||||
@ -502,10 +526,6 @@ The same citation in BibTex format:
|
|||||||
Methods, 41 (4), 977-990.
|
Methods, 41 (4), 977-990.
|
||||||
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
|
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
|
||||||
|
|
||||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A. M., Bölte, J., & Böhl, A.
|
|
||||||
(2015). The word frequency effect. Experimental Psychology.
|
|
||||||
http://econtent.hogrefe.com/doi/abs/10.1027/1618-3169/a000123?journalCode=zea
|
|
||||||
|
|
||||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
||||||
(2011). The word frequency effect: A review of recent developments and
|
(2011). The word frequency effect: A review of recent developments and
|
||||||
implications for the choice of frequency estimates in German. Experimental
|
implications for the choice of frequency estimates in German. Experimental
|
||||||
@ -515,9 +535,6 @@ The same citation in BibTex format:
|
|||||||
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
||||||
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
||||||
|
|
||||||
- Dave, H. (2011). Frequency word lists.
|
|
||||||
https://invokeit.wordpress.com/frequency-word-lists/
|
|
||||||
|
|
||||||
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
||||||
http://unicode.org/reports/tr29/
|
http://unicode.org/reports/tr29/
|
||||||
|
|
||||||
@ -535,11 +552,19 @@ The same citation in BibTex format:
|
|||||||
analyzer.
|
analyzer.
|
||||||
http://mecab.sourceforge.net/
|
http://mecab.sourceforge.net/
|
||||||
|
|
||||||
|
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
||||||
|
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
||||||
|
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
||||||
|
http://aclweb.org/anthology/P12-3029
|
||||||
|
|
||||||
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
||||||
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
||||||
International Conference on Language Resources and Evaluation (LREC 2016).
|
International Conference on Language Resources and Evaluation (LREC 2016).
|
||||||
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
||||||
|
|
||||||
|
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
||||||
|
European Languages. https://paracrawl.eu/
|
||||||
|
|
||||||
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
||||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||||
|
Loading…
Reference in New Issue
Block a user