mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Add wordfreq_data files.
Now the build process is repeatable from scratch, even if something goes
wrong with the download server.
Former-commit-id: 26c0d7dd28
This commit is contained in:
parent
101e767ad9
commit
9163a67a9f
1
.gitignore
vendored
1
.gitignore
vendored
@ -6,5 +6,4 @@ dist
|
||||
pip-log.txt
|
||||
.coverage
|
||||
*~
|
||||
wordfreq_data/
|
||||
wordfreq-data.tar.gz
|
||||
|
10
wordfreq_data/google/README.txt
Normal file
10
wordfreq_data/google/README.txt
Normal file
@ -0,0 +1,10 @@
|
||||
This data was compiled from the Google Books Ngram Viewer data, particularly
|
||||
the 2012 English dataset.
|
||||
|
||||
The data is available from https://books.google.com/ngrams. The terms of use of
|
||||
this data are:
|
||||
|
||||
"Ngram Viewer graphs and data may be freely used for any purpose, although
|
||||
acknowledgement of Google Books Ngram Viewer as the source, and inclusion of a
|
||||
link to http://books.google.com/ngrams, would be appreciated."
|
||||
|
@ -0,0 +1 @@
|
||||
48b238cc5b3d359d0e8ac48f6321aca27c1ec098
|
5
wordfreq_data/leeds/README.txt
Normal file
5
wordfreq_data/leeds/README.txt
Normal file
@ -0,0 +1,5 @@
|
||||
These wordlists come from the University of Leeds Centre for Translation
|
||||
Studies, and are provided for free under a Creative Commons Attribution
|
||||
license.
|
||||
|
||||
For more information, see: http://corpus.leeds.ac.uk/list.html
|
100012
wordfreq_data/leeds/internet-ar-forms.num
Normal file
100012
wordfreq_data/leeds/internet-ar-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
25004
wordfreq_data/leeds/internet-de-forms.num
Normal file
25004
wordfreq_data/leeds/internet-de-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
100012
wordfreq_data/leeds/internet-el-forms.num
Normal file
100012
wordfreq_data/leeds/internet-el-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
25004
wordfreq_data/leeds/internet-en-forms.num
Normal file
25004
wordfreq_data/leeds/internet-en-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
50004
wordfreq_data/leeds/internet-es-forms.num
Normal file
50004
wordfreq_data/leeds/internet-es-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
50004
wordfreq_data/leeds/internet-fr-forms.num
Normal file
50004
wordfreq_data/leeds/internet-fr-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
50004
wordfreq_data/leeds/internet-it-forms.num
Normal file
50004
wordfreq_data/leeds/internet-it-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
45004
wordfreq_data/leeds/internet-ja-forms.num
Normal file
45004
wordfreq_data/leeds/internet-ja-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
25004
wordfreq_data/leeds/internet-pt-forms.num
Normal file
25004
wordfreq_data/leeds/internet-pt-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
25004
wordfreq_data/leeds/internet-ru-forms.num
Normal file
25004
wordfreq_data/leeds/internet-ru-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
50004
wordfreq_data/leeds/internet-zh-forms.num
Normal file
50004
wordfreq_data/leeds/internet-zh-forms.num
Normal file
File diff suppressed because it is too large
Load Diff
50015
wordfreq_data/leeds/rnc-modern.num.html
Normal file
50015
wordfreq_data/leeds/rnc-modern.num.html
Normal file
File diff suppressed because it is too large
Load Diff
17
wordfreq_data/luminoso/README.txt
Normal file
17
wordfreq_data/luminoso/README.txt
Normal file
@ -0,0 +1,17 @@
|
||||
This directory contains two wordlists we've put together at Luminoso for our
|
||||
own purposes. You might find them useful as well.
|
||||
|
||||
* `twitter-52M` collects the unigram word frequencies from 52 million tweets.
|
||||
The words are not distinguished by language.
|
||||
|
||||
* `multi` combines various sources of data in different languages, including:
|
||||
|
||||
* Google Books, for English
|
||||
* A smaller corpus of tweets that supposedly come from English speakers
|
||||
(there's still a lot of non-English text in there)
|
||||
* the Leeds corpora for various languages (see `../leeds/README.txt`)
|
||||
|
||||
We would like to release the tools that built `twitter-52M` as soon as they are
|
||||
less sloppy. `multi` is a dataset that is mainly relevant because it's the data
|
||||
we happen to already be using, but you might find it useful as well.
|
||||
|
1
wordfreq_data/luminoso/multilingual.csv.REMOVED.git-id
Normal file
1
wordfreq_data/luminoso/multilingual.csv.REMOVED.git-id
Normal file
@ -0,0 +1 @@
|
||||
f24577ba6807c884bca4464a8624beda68d8df79
|
1
wordfreq_data/luminoso/twitter-52M.csv.REMOVED.git-id
Normal file
1
wordfreq_data/luminoso/twitter-52M.csv.REMOVED.git-id
Normal file
@ -0,0 +1 @@
|
||||
4c5a66db8a4190a173814a4d7b31b925c5b131d1
|
Loading…
Reference in New Issue
Block a user