mirror of
https://github.com/rspeer/wordfreq.git
synced 2025-01-15 05:36:01 +00:00
parent
1d4a18ead2
commit
9c08442dc5
10
README.md
10
README.md
@ -26,7 +26,7 @@ install them on Ubuntu:
|
|||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
wordfreq provides access to estimates of the frequency with which a word is
|
wordfreq provides access to estimates of the frequency with which a word is
|
||||||
used, in 15 languages (see *Supported languages* below). It loads
|
used, in 16 languages (see *Supported languages* below). It loads
|
||||||
efficiently-packed data structures that contain all words that appear at least
|
efficiently-packed data structures that contain all words that appear at least
|
||||||
once per million words.
|
once per million words.
|
||||||
|
|
||||||
@ -122,13 +122,13 @@ of word usage on different topics at different levels of formality. The sources
|
|||||||
- **Twitter**: Messages sampled from Twitter's public stream
|
- **Twitter**: Messages sampled from Twitter's public stream
|
||||||
- **Wikipedia**: The full text of Wikipedia in 2015
|
- **Wikipedia**: The full text of Wikipedia in 2015
|
||||||
|
|
||||||
The following 12 languages are well-supported, with reasonable tokenization and
|
The following 14 languages are well-supported, with reasonable tokenization and
|
||||||
at least 3 different sources of word frequencies:
|
at least 3 different sources of word frequencies:
|
||||||
|
|
||||||
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
Language Code GBooks SUBTLEX LeedsIC OpenSub Twitter Wikipedia
|
||||||
──────────────────┼──────────────────────────────────────────────────
|
──────────────────┼──────────────────────────────────────────────────
|
||||||
Arabic ar │ - - Yes Yes Yes Yes
|
Arabic ar │ - - Yes Yes Yes Yes
|
||||||
German de │ - Yes Yes Yes Yes[1] Yes
|
German de │ - Yes Yes - Yes[1] Yes
|
||||||
Greek el │ - - Yes Yes Yes Yes
|
Greek el │ - - Yes Yes Yes Yes
|
||||||
English en │ Yes Yes Yes Yes Yes Yes
|
English en │ Yes Yes Yes Yes Yes Yes
|
||||||
Spanish es │ - - Yes Yes Yes Yes
|
Spanish es │ - - Yes Yes Yes Yes
|
||||||
@ -225,9 +225,7 @@ sources:
|
|||||||
|
|
||||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
|
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK, and
|
||||||
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
|
SUBTLEX-CH, created by Marc Brysbaert et al. and available at
|
||||||
http://crr.ugent.be/programs-data/subtitle-frequencies. SUBTLEX was first
|
http://crr.ugent.be/programs-data/subtitle-frequencies.
|
||||||
published in this paper:
|
|
||||||
|
|
||||||
|
|
||||||
I (Rob Speer) have
|
I (Rob Speer) have
|
||||||
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
|
obtained permission by e-mail from Marc Brysbaert to distribute these wordlists
|
||||||
|
@ -253,11 +253,9 @@ def subtlex_other_deps(dirname_in, languages):
|
|||||||
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
|
output_file = wordlist_filename('subtlex-other', language, 'counts.txt')
|
||||||
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
|
textcol, freqcol = SUBTLEX_COLUMN_MAP[language]
|
||||||
|
|
||||||
# Greek has three extra header lines for no reason
|
# Skip one header line by setting 'startrow' to 2 (because tail is 1-based).
|
||||||
if language == 'el':
|
# I hope we don't need to configure this by language anymore.
|
||||||
startrow = 5
|
startrow = 2
|
||||||
else:
|
|
||||||
startrow = 2
|
|
||||||
|
|
||||||
add_dep(
|
add_dep(
|
||||||
lines, 'convert_subtlex', input_file, processed_file,
|
lines, 'convert_subtlex', input_file, processed_file,
|
||||||
|
Loading…
Reference in New Issue
Block a user