wordfreq
Copyright 2022 Robyn Speer
# Attribution notes
Robyn Speer must be credited as Robyn Speer, which is her maiden name, used on academic work.
Crediting her as Elia Robyn Lake (her married name) will make the credit less effective, as it will
not line up with other work.
Crediting Robyn Speer by a different name than one of the above is a serious violation of the license,
in which case you do not have permission to use, copy, or redistribute wordfreq.
If you use wordfreq in academic work, you must cite it. See "Citing wordfreq" in README.md.
# Included licenses
`wordfreq` is freely redistributable under the Apache license (see
`LICENSE.txt`), and it includes data files that may be
redistributed under a Creative Commons Attribution-ShareAlike 4.0
license ().
`wordfreq` contains data extracted from Google Books Ngrams
() and Google Books Syntactic Ngrams
().
The terms of use of this data are:
Ngram Viewer graphs and data may be freely used for any purpose, although
acknowledgement of Google Books Ngram Viewer as the source, and inclusion
of a link to http://books.google.com/ngrams, would be appreciated.
`wordfreq` also contains data derived from the following Creative Commons-licensed
sources:
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
Studies ()
- Wikipedia, the free encyclopedia ()
- ParaCrawl, a multilingual Web crawl ()
It contains data from OPUS OpenSubtitles 2018
(), whose data originates from the
OpenSubtitles project () and may be used with
attribution to OpenSubtitles.
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
(see citations below) and available at
.
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
distribute these wordlists in wordfreq, to be used for any purpose, not just
for academic use, under these conditions:
- Wordfreq and code derived from it must credit the SUBTLEX authors.
- It must remain clear that SUBTLEX is freely available data.
These terms are similar to the Creative Commons Attribution-ShareAlike license.
Some additional data was collected by a custom application that watched the
streaming Twitter API, in accordance with Twitter's Developer Agreement &
Policy. This software gives statistics about words that were commonly used on
Twitter; it does not display or republish any Twitter content, and does not
contain any content from after Twitter's sale.