mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
parent
f7a4e2c444
commit
872556f7bb
10
README.md
10
README.md
@ -26,7 +26,7 @@ install them on Ubuntu:
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
used, in 15 languages (see *Supported languages* below). It loads
|
||||
used, in 16 languages (see *Supported languages* below). It loads
|
||||
efficiently-packed data structures that contain all words that appear at least
|
||||
once per million words.
|
||||
|
||||
@ -122,13 +122,13 @@ of word usage on different topics at different levels of formality. The sources
|
||||
- **Twitter**: Messages sampled from Twitter's public stream
|
||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
||||
|
||||
The following 12 languages are well-supported, with reasonable tokenization and
|
||||
The following 14 languages are well-supported, with reasonable tokenization and
|
||||
at least 3 different sources of word frequencies:
|
||||
|
||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||
──────────────────┼──────────────────────────────────────────────────
|
||||
Arabic ar │ - - Yes Yes Yes Yes
|
||||
German de │ - Yes Yes Yes Yes[1] Yes
|
||||
German de │ - Yes Yes - Yes[1] Yes
|
||||
Greek el │ - - Yes Yes Yes Yes
|
||||
English en │ Yes Yes Yes Yes Yes Yes
|
||||
Spanish es │ - - Yes Yes Yes Yes
|
||||
@ -225,9 +225,7 @@ sources:
|
||||
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
|
||||
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
|
||||
http://crr.ugent.be/programs-data/subtitle-frequencies. SUBTLEX was first
|
||||
published in this paper:
|
||||
|
||||
http://crr.ugent.be/programs-data/subtitle-frequencies.
|
||||
|
||||
I (Robyn Speer) have
|
||||
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
|
||||
|
@ -253,10 +253,8 @@ def subtlex_other_deps(dirname_in, languages):
|
||||
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
|
||||
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
|
||||
|
||||
# Greek has three extra header lines for no reason
|
||||
if language == 'el':
|
||||
startrow = 5
|
||||
else:
|
||||
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
|
||||
# I hope we don't need to configure this by language anymore.
|
||||
startrow = 2
|
||||
|
||||
add_dep(
|
||||
|
Loading…
Reference in New Issue
Block a user