mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
update version and documentation
This commit is contained in:
parent
5d6a41499b
commit
2563eb8d72
45
CHANGELOG.md
45
CHANGELOG.md
@ -1,9 +1,34 @@
|
||||
# Changelog
|
||||
|
||||
## Version 3.0 (2022-03-10)
|
||||
|
||||
This is the "handle numbers better" release.
|
||||
|
||||
Previously, wordfreq would group all digit sequences of the same 'shape',
|
||||
with length 2 or more, into a single token and return the frequency of that
|
||||
token, which would be a vast overestimate.
|
||||
|
||||
Now it distributes the frequency over all numbers of that shape, with an
|
||||
estimated distribution that allows for Benford's law (lower numbers are more
|
||||
frequent) and a special frequency distribution for 4-digit numbers that look
|
||||
like years (2010 is more frequent than 1020).
|
||||
|
||||
Relatedly:
|
||||
|
||||
- Functions such as `iter_wordlist` and `top_n_list` no longer return
|
||||
multi-digit numbers (they used to return them in their "smashed" form, such
|
||||
as "0000").
|
||||
|
||||
- `lossy_tokenize` no longer replaces digit sequences with 0s. That happens
|
||||
instead in a place that's internal to the `word_frequency` function, so we can
|
||||
look at the values of the digits before they're replaced.
|
||||
|
||||
## Version 2.5.1 (2021-09-02)
|
||||
|
||||
- Import ftfy and use its `uncurl_quotes` method to turn curly quotes into
|
||||
straight ones, providing consistency with multiple forms of apostrophes.
|
||||
|
||||
- Set minimum version requierements on `regex`, `jieba`, and `langcodes`
|
||||
- Set minimum version requirements on `regex`, `jieba`, and `langcodes`
|
||||
so that tokenization will give consistent results.
|
||||
|
||||
- Workaround an inconsistency in the `msgpack` API around
|
||||
@ -83,7 +108,6 @@ Library changes:
|
||||
|
||||
- Fixed calling `msgpack.load` with a deprecated parameter.
|
||||
|
||||
|
||||
## Version 2.2 (2018-07-24)
|
||||
|
||||
Library change:
|
||||
@ -104,7 +128,6 @@ Data changes:
|
||||
- The input data includes the change to tokenization described above, giving
|
||||
us word frequencies for words such as "l@s".
|
||||
|
||||
|
||||
## Version 2.1 (2018-06-18)
|
||||
|
||||
Data changes:
|
||||
@ -125,7 +148,6 @@ Library changes:
|
||||
in `/usr/lib/x86_64-linux-gnu/mecab`, which is where Ubuntu 18.04 puts them
|
||||
when they are installed from source.
|
||||
|
||||
|
||||
## Version 2.0.1 (2018-05-01)
|
||||
|
||||
Fixed edge cases that inserted spurious token boundaries when Japanese text is
|
||||
@ -148,8 +170,6 @@ use the iteration mark 々.
|
||||
This change does not affect any word frequencies. (The Japanese word list uses
|
||||
`wordfreq.mecab` for tokenization, not `simple_tokenize`.)
|
||||
|
||||
|
||||
|
||||
## Version 2.0 (2018-03-14)
|
||||
|
||||
The big change in this version is that text preprocessing, tokenization, and
|
||||
@ -212,7 +232,6 @@ Nitty gritty dependency changes:
|
||||
|
||||
[exquisite-corpus]: https://github.com/LuminosoInsight/exquisite-corpus
|
||||
|
||||
|
||||
## Version 1.7.0 (2017-08-25)
|
||||
|
||||
- Tokenization will always keep Unicode graphemes together, including
|
||||
@ -223,7 +242,6 @@ Nitty gritty dependency changes:
|
||||
- Support Bengali and Macedonian, which passed the threshold of having enough
|
||||
source data to be included
|
||||
|
||||
|
||||
## Version 1.6.1 (2017-05-10)
|
||||
|
||||
- Depend on langcodes 1.4, with a new language-matching system that does not
|
||||
@ -232,13 +250,12 @@ Nitty gritty dependency changes:
|
||||
This prevents silly conflicts where langcodes' SQLite connection was
|
||||
preventing langcodes from being used in threads.
|
||||
|
||||
|
||||
## Version 1.6.0 (2017-01-05)
|
||||
|
||||
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
|
||||
- Add large lists in Chinese, Finnish, Japanese, and Polish
|
||||
- Data is now collected and built using Exquisite Corpus
|
||||
(https://github.com/LuminosoInsight/exquisite-corpus)
|
||||
(<https://github.com/LuminosoInsight/exquisite-corpus>)
|
||||
- Add word frequencies from OPUS OpenSubtitles 2016
|
||||
- Add word frequencies from the MOKK Hungarian Webcorpus
|
||||
- Expand Google Books Ngrams data to cover 8 languages
|
||||
@ -255,13 +272,11 @@ Nitty gritty dependency changes:
|
||||
- Another new frequency-merging strategy (drop the highest and lowest,
|
||||
average the rest)
|
||||
|
||||
|
||||
## Version 1.5.1 (2016-08-19)
|
||||
|
||||
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the
|
||||
other one is not available
|
||||
|
||||
|
||||
## Version 1.5.0 (2016-08-08)
|
||||
|
||||
- Include word frequencies learned from the Common Crawl
|
||||
@ -280,7 +295,6 @@ Nitty gritty dependency changes:
|
||||
|
||||
[Announcement blog post](https://blog.conceptnet.io/2016/08/22/wordfreq-1-5-more-data-more-languages-more-accuracy)
|
||||
|
||||
|
||||
## Version 1.4 (2016-06-02)
|
||||
|
||||
- Add large lists in English, German, Spanish, French, and Portuguese
|
||||
@ -288,12 +302,10 @@ Nitty gritty dependency changes:
|
||||
|
||||
[Announcement blog post](https://blog.conceptnet.io/2016/06/02/wordfreq-1-4-more-words-plus-word-frequencies-from-reddit/)
|
||||
|
||||
|
||||
## Version 1.3 (2016-01-14)
|
||||
|
||||
- Add Reddit comments as an English source
|
||||
|
||||
|
||||
## Version 1.2 (2015-10-29)
|
||||
|
||||
- Add SUBTLEX data
|
||||
@ -307,14 +319,12 @@ Nitty gritty dependency changes:
|
||||
|
||||
[Announcement blog post](https://blog.luminoso.com/2015/10/29/wordfreq-1-2-is-better-at-chinese-english-greek-polish-swedish-and-turkish/)
|
||||
|
||||
|
||||
## Version 1.1 (2015-08-25)
|
||||
|
||||
- Use the 'regex' package to implement Unicode tokenization that's mostly
|
||||
consistent across languages
|
||||
- Use NFKC normalization in Japanese and Arabic
|
||||
|
||||
|
||||
## Version 1.0 (2015-07-28)
|
||||
|
||||
- Create compact word frequency lists in English, Arabic, German, Spanish,
|
||||
@ -322,4 +332,3 @@ Nitty gritty dependency changes:
|
||||
- Marginal support for Greek, Korean, Chinese
|
||||
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably
|
||||
large downloads
|
||||
|
||||
|
123
README.md
123
README.md
@ -3,7 +3,6 @@ languages, based on many sources of data.
|
||||
|
||||
Author: Robyn Speer
|
||||
|
||||
|
||||
## Installation
|
||||
|
||||
wordfreq requires Python 3 and depends on a few other Python modules
|
||||
@ -19,7 +18,6 @@ or by getting the repository and running its setup.py:
|
||||
See [Additional CJK installation](#additional-cjk-installation) for extra
|
||||
steps that are necessary to get Chinese, Japanese, and Korean word frequencies.
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
wordfreq provides access to estimates of the frequency with which a word is
|
||||
@ -56,7 +54,6 @@ frequency as a decimal between 0 and 1.
|
||||
>>> word_frequency('café', 'fr')
|
||||
5.75e-05
|
||||
|
||||
|
||||
`zipf_frequency` is a variation on `word_frequency` that aims to return the
|
||||
word frequency on a human-friendly logarithmic scale. The Zipf scale was
|
||||
proposed by Marc Brysbaert, who created the SUBTLEX lists. The Zipf frequency
|
||||
@ -86,7 +83,6 @@ one occurrence per billion words.
|
||||
>>> zipf_frequency('zipf', 'en', wordlist='small')
|
||||
0.0
|
||||
|
||||
|
||||
The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
|
||||
- `word`: a Unicode string containing the word to look up. Ideally the word
|
||||
@ -103,7 +99,6 @@ The parameters to `word_frequency` and `zipf_frequency` are:
|
||||
value contained in the wordlist, to avoid a discontinuity where the wordlist
|
||||
ends.
|
||||
|
||||
|
||||
## Frequency bins
|
||||
|
||||
wordfreq's wordlists are designed to load quickly and take up little space in
|
||||
@ -120,12 +115,11 @@ Because the Zipf scale is a logarithmic scale, this preserves the same relative
|
||||
precision no matter how far down you are in the word list. The frequency of any
|
||||
word is precise to within 1%.
|
||||
|
||||
(This is not a claim about _accuracy_, but about _precision_. We believe that
|
||||
(This is not a claim about *accuracy*, but about *precision*. We believe that
|
||||
the way we use multiple data sources and discard outliers makes wordfreq a
|
||||
more accurate measurement of the way these words are really used in written
|
||||
language, but it's unclear how one would measure this accuracy.)
|
||||
|
||||
|
||||
## The figure-skating metric
|
||||
|
||||
We combine word frequencies from different sources in a way that's designed
|
||||
@ -137,6 +131,68 @@ in Olympic figure skating:
|
||||
- Average the remaining frequencies.
|
||||
- Rescale the resulting frequency list to add up to 1.
|
||||
|
||||
## Numbers
|
||||
|
||||
These wordlists would be enormous if they stored a separate frequency for every
|
||||
number, such as if we separately stored the frequencies of 484977 and 484978
|
||||
and 98.371 and every other 6-character sequence that could be considered a number.
|
||||
|
||||
Instead, we have a frequency-bin entry for every number of the same "shape", such
|
||||
as `##` or `####` or `#.#####`, with `#` standing in for digits. (For compatibility
|
||||
with earlier versions of wordfreq, our stand-in character is actually `0`.) This
|
||||
is the same form of aggregation that the word2vec vocabulary does.
|
||||
|
||||
Single-digit numbers are unaffected by this "binning" process; "0" through "9" have
|
||||
their own entries in each language's wordlist.
|
||||
|
||||
When asked for the frequency of a token containing multiple digits, we multiply
|
||||
the frequency of that aggregated entry by a distribution estimating the frequency
|
||||
of those digits. The distribution only looks at two things:
|
||||
|
||||
- The value of the first digit
|
||||
- Whether it is a 4-digit sequence that's likely to represent a year
|
||||
|
||||
The first digits are assigned probabilities by Benford's law, and years are assigned
|
||||
probabilities from a distribution that peaks at the "present". I explored this in
|
||||
a Twitter thread at <https://twitter.com/r_speer/status/1493715982887571456>.
|
||||
|
||||
The part of this distribution representing the "present" is not strictly a peak;
|
||||
it's a 20-year-long plateau from 2019 to 2039. (2019 is the last time Google Books
|
||||
Ngrams was updated, and 2039 is a time by which I will probably have figured out
|
||||
a new distribution.)
|
||||
|
||||
Some examples:
|
||||
|
||||
>>> word_frequency("2022", "en")
|
||||
5.15e-05
|
||||
>>> word_frequency("1922", "en")
|
||||
8.19e-06
|
||||
>>> word_frequency("1022", "en")
|
||||
1.28e-07
|
||||
|
||||
Aside from years, the distribution does **not** care about the meaning of the numbers:
|
||||
|
||||
>>> word_frequency("90210", "en")
|
||||
3.34e-10
|
||||
>>> word_frequency("92222", "en")
|
||||
3.34e-10
|
||||
>>> word_frequency("802.11n", "en")
|
||||
9.04e-13
|
||||
>>> word_frequency("899.19n", "en")
|
||||
9.04e-13
|
||||
|
||||
The digit rule applies to other systems of digits, and only cares about the numeric
|
||||
value of the digits:
|
||||
|
||||
>>> word_frequency("٥٤", "ar")
|
||||
6.64e-05
|
||||
>>> word_frequency("54", "ar")
|
||||
6.64e-05
|
||||
|
||||
It doesn't know which language uses which writing system for digits:
|
||||
|
||||
>>> word_frequency("٥٤", "en")
|
||||
5.4e-05
|
||||
|
||||
## Sources and supported languages
|
||||
|
||||
@ -227,7 +283,6 @@ Some languages provide 'large' wordlists, including words with a Zipf frequency
|
||||
between 1.0 and 3.0. These are available in 14 languages that are covered by
|
||||
enough data sources.
|
||||
|
||||
|
||||
## Other functions
|
||||
|
||||
`tokenize(text, lang)` splits text in the given language into words, in the same
|
||||
@ -273,7 +328,6 @@ ASCII. But maybe you should just use [xkpa][].
|
||||
[xkcd936]: https://xkcd.com/936/
|
||||
[xkpa]: https://github.com/beala/xkcd-password
|
||||
|
||||
|
||||
## Tokenization
|
||||
|
||||
wordfreq uses the Python package `regex`, which is a more advanced
|
||||
@ -335,7 +389,6 @@ their frequency:
|
||||
>>> zipf_frequency('owl-flavored', 'en')
|
||||
3.3
|
||||
|
||||
|
||||
## Multi-script languages
|
||||
|
||||
Two of the languages we support, Serbian and Chinese, are written in multiple
|
||||
@ -358,7 +411,6 @@ Enumerating the Chinese wordlist will produce some unfamiliar words, because
|
||||
people don't actually write in Oversimplified Chinese, and because in
|
||||
practice Traditional and Simplified Chinese also have different word usage.
|
||||
|
||||
|
||||
## Similar, overlapping, and varying languages
|
||||
|
||||
As much as we would like to give each language its own distinct code and its
|
||||
@ -384,7 +436,6 @@ module to find the best match for a language code. If you ask for word
|
||||
frequencies in `cmn-Hans` (the fully specific language code for Mandarin in
|
||||
Simplified Chinese), you will get the `zh` wordlist, for example.
|
||||
|
||||
|
||||
## Additional CJK installation
|
||||
|
||||
Chinese, Japanese, and Korean have additional external dependencies so that
|
||||
@ -399,17 +450,16 @@ and `mecab-ko-dic`.
|
||||
|
||||
As of version 2.4.2, you no longer have to install dictionaries separately.
|
||||
|
||||
|
||||
## License
|
||||
|
||||
`wordfreq` is freely redistributable under the MIT license (see
|
||||
`MIT-LICENSE.txt`), and it includes data files that may be
|
||||
redistributed under a Creative Commons Attribution-ShareAlike 4.0
|
||||
license (https://creativecommons.org/licenses/by-sa/4.0/).
|
||||
license (<https://creativecommons.org/licenses/by-sa/4.0/>).
|
||||
|
||||
`wordfreq` contains data extracted from Google Books Ngrams
|
||||
(http://books.google.com/ngrams) and Google Books Syntactic Ngrams
|
||||
(http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html).
|
||||
(<http://books.google.com/ngrams>) and Google Books Syntactic Ngrams
|
||||
(<http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html>).
|
||||
The terms of use of this data are:
|
||||
|
||||
Ngram Viewer graphs and data may be freely used for any purpose, although
|
||||
@ -420,21 +470,21 @@ The terms of use of this data are:
|
||||
sources:
|
||||
|
||||
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation
|
||||
Studies (http://corpus.leeds.ac.uk/list.html)
|
||||
Studies (<http://corpus.leeds.ac.uk/list.html>)
|
||||
|
||||
- Wikipedia, the free encyclopedia (http://www.wikipedia.org)
|
||||
- Wikipedia, the free encyclopedia (<http://www.wikipedia.org>)
|
||||
|
||||
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu)
|
||||
- ParaCrawl, a multilingual Web crawl (<https://paracrawl.eu>)
|
||||
|
||||
It contains data from OPUS OpenSubtitles 2018
|
||||
(http://opus.nlpl.eu/OpenSubtitles.php), whose data originates from the
|
||||
OpenSubtitles project (http://www.opensubtitles.org/) and may be used with
|
||||
(<http://opus.nlpl.eu/OpenSubtitles.php>), whose data originates from the
|
||||
OpenSubtitles project (<http://www.opensubtitles.org/>) and may be used with
|
||||
attribution to OpenSubtitles.
|
||||
|
||||
It contains data from various SUBTLEX word lists: SUBTLEX-US, SUBTLEX-UK,
|
||||
SUBTLEX-CH, SUBTLEX-DE, and SUBTLEX-NL, created by Marc Brysbaert et al.
|
||||
(see citations below) and available at
|
||||
http://crr.ugent.be/programs-data/subtitle-frequencies.
|
||||
<http://crr.ugent.be/programs-data/subtitle-frequencies>.
|
||||
|
||||
I (Robyn Speer) have obtained permission by e-mail from Marc Brysbaert to
|
||||
distribute these wordlists in wordfreq, to be used for any purpose, not just
|
||||
@ -450,7 +500,6 @@ streaming Twitter API, in accordance with Twitter's Developer Agreement &
|
||||
Policy. This software gives statistics about words that are commonly used on
|
||||
Twitter; it does not display or republish any Twitter content.
|
||||
|
||||
|
||||
## Citing wordfreq
|
||||
|
||||
If you use wordfreq in your research, please cite it! We publish the code
|
||||
@ -459,8 +508,7 @@ citation is:
|
||||
|
||||
> Robyn Speer, Joshua Chin, Andrew Lin, Sara Jewett, & Lance Nathan.
|
||||
> (2018, October 3). LuminosoInsight/wordfreq: v2.2. Zenodo.
|
||||
> https://doi.org/10.5281/zenodo.1443582
|
||||
|
||||
> <https://doi.org/10.5281/zenodo.1443582>
|
||||
|
||||
The same citation in BibTex format:
|
||||
|
||||
@ -479,20 +527,19 @@ The same citation in BibTex format:
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Citations to work that wordfreq is built on
|
||||
|
||||
- Bojar, O., Chatterjee, R., Federmann, C., Haddow, B., Huck, M., Hokamp, C.,
|
||||
Koehn, P., Logacheva, V., Monz, C., Negri, M., Post, M., Scarton, C.,
|
||||
Specia, L., & Turchi, M. (2015). Findings of the 2015 Workshop on Statistical
|
||||
Machine Translation.
|
||||
http://www.statmt.org/wmt15/results.html
|
||||
<http://www.statmt.org/wmt15/results.html>
|
||||
|
||||
- Brysbaert, M. & New, B. (2009). Moving beyond Kucera and Francis: A Critical
|
||||
Evaluation of Current Word Frequency Norms and the Introduction of a New and
|
||||
Improved Word Frequency Measure for American English. Behavior Research
|
||||
Methods, 41 (4), 977-990.
|
||||
http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf
|
||||
<http://sites.google.com/site/borisnew/pub/BrysbaertNew2009.pdf>
|
||||
|
||||
- Brysbaert, M., Buchmeier, M., Conrad, M., Jacobs, A.M., Bölte, J., & Böhl, A.
|
||||
(2011). The word frequency effect: A review of recent developments and
|
||||
@ -501,45 +548,45 @@ The same citation in BibTex format:
|
||||
|
||||
- Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character
|
||||
frequencies based on film subtitles. PLoS One, 5(6), e10729.
|
||||
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729
|
||||
<http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0010729>
|
||||
|
||||
- Davis, M. (2012). Unicode text segmentation. Unicode Standard Annex, 29.
|
||||
http://unicode.org/reports/tr29/
|
||||
<http://unicode.org/reports/tr29/>
|
||||
|
||||
- Halácsy, P., Kornai, A., Németh, L., Rung, A., Szakadát, I., & Trón, V.
|
||||
(2004). Creating open language resources for Hungarian. In Proceedings of the
|
||||
4th international conference on Language Resources and Evaluation (LREC2004).
|
||||
http://mokk.bme.hu/resources/webcorpus/
|
||||
<http://mokk.bme.hu/resources/webcorpus/>
|
||||
|
||||
- Keuleers, E., Brysbaert, M. & New, B. (2010). SUBTLEX-NL: A new frequency
|
||||
measure for Dutch words based on film subtitles. Behavior Research Methods,
|
||||
42(3), 643-650.
|
||||
http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf
|
||||
<http://crr.ugent.be/papers/SUBTLEX-NL_BRM.pdf>
|
||||
|
||||
- Kudo, T. (2005). Mecab: Yet another part-of-speech and morphological
|
||||
analyzer.
|
||||
http://mecab.sourceforge.net/
|
||||
<http://mecab.sourceforge.net/>
|
||||
|
||||
- Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., and Petrov,
|
||||
S. (2012). Syntactic annotations for the Google Books Ngram Corpus.
|
||||
Proceedings of the ACL 2012 system demonstrations, 169-174.
|
||||
http://aclweb.org/anthology/P12-3029
|
||||
<http://aclweb.org/anthology/P12-3029>
|
||||
|
||||
- Lison, P. and Tiedemann, J. (2016). OpenSubtitles2016: Extracting Large
|
||||
Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th
|
||||
International Conference on Language Resources and Evaluation (LREC 2016).
|
||||
http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
|
||||
<http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf>
|
||||
|
||||
- Ortiz Suárez, P. J., Sagot, B., and Romary, L. (2019). Asynchronous pipelines
|
||||
for processing huge corpora on medium to low resource infrastructures. In
|
||||
Proceedings of the Workshop on Challenges in the Management of Large Corpora
|
||||
(CMLC-7) 2019.
|
||||
https://oscar-corpus.com/publication/2019/clmc7/asynchronous/
|
||||
<https://oscar-corpus.com/publication/2019/clmc7/asynchronous/>
|
||||
|
||||
- ParaCrawl (2018). Provision of Web-Scale Parallel Corpora for Official
|
||||
European Languages. https://paracrawl.eu/
|
||||
European Languages. <https://paracrawl.eu/>
|
||||
|
||||
- van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014).
|
||||
SUBTLEX-UK: A new and improved word frequency database for British English.
|
||||
The Quarterly Journal of Experimental Psychology, 67(6), 1176-1190.
|
||||
http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521
|
||||
<http://www.tandfonline.com/doi/pdf/10.1080/17470218.2013.850521>
|
||||
|
@ -1,6 +1,6 @@
|
||||
[tool.poetry]
|
||||
name = "wordfreq"
|
||||
version = "2.6.0"
|
||||
version = "3.0.0"
|
||||
description = "Look up the frequencies of words in many languages, based on many sources of data."
|
||||
authors = ["Robyn Speer <rspeer@arborelia.net>"]
|
||||
license = "MIT"
|
||||
|
2
setup.py
2
setup.py
@ -33,7 +33,7 @@ dependencies = [
|
||||
|
||||
setup(
|
||||
name="wordfreq",
|
||||
version='2.6.0',
|
||||
version='3.0.0',
|
||||
maintainer='Robyn Speer',
|
||||
maintainer_email='rspeer@arborelia.net',
|
||||
url='http://github.com/rspeer/wordfreq/',
|
||||
|
@ -16,6 +16,11 @@ def test_decimals():
|
||||
assert word_frequency("3,14", "de") == word_frequency("3,15", "de")
|
||||
|
||||
|
||||
def test_eastern_arabic():
|
||||
assert word_frequency("٥٤", "ar") == word_frequency("٥٣", "ar")
|
||||
assert word_frequency("٤٣", "ar") > word_frequency("٥٤", "ar")
|
||||
|
||||
|
||||
def test_year_distribution():
|
||||
assert word_frequency("2010", "en") > word_frequency("1010", "en")
|
||||
assert word_frequency("2010", "en") > word_frequency("3010", "en")
|
||||
|
Loading…
Reference in New Issue
Block a user