mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 17:31:41 +00:00
4.0 KiB
4.0 KiB
Version 1.7.0 (2017-08-25)
- Tokenization will always keep Unicode graphemes together, including complex emoji introduced in Unicode 10
- Update the Wikipedia source data to April 2017
- Remove some non-words, such as the Unicode replacement character and the pilcrow sign, from frequency lists
- Support Bengali and Macedonian, which passed the threshold of having enough source data to be included
Version 1.6.1 (2017-05-10)
-
Depend on langcodes 1.4, with a new language-matching system that does not depend on SQLite.
This prevents silly conflicts where langcodes' SQLite connection was preventing langcodes from being used in threads.
Version 1.6.0 (2017-01-05)
- Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
- Add large lists in Chinese, Finnish, Japanese, and Polish
- Data is now collected and built using Exquisite Corpus (https://github.com/LuminosoInsight/exquisite-corpus)
- Add word frequencies from OPUS OpenSubtitles 2016
- Add word frequencies from the MOKK Hungarian Webcorpus
- Expand Google Books Ngrams data to cover 8 languages
- Expand language detection on Reddit to cover 13 languages with large enough Reddit communities
- Drop the Common Crawl; we have enough good sources now that we don't have to deal with all that spam
- Add automatic transliteration of Serbian text
- Adjust tokenization of apostrophes next to vowel sounds: the French word "l'heure" is now tokenized similarly to "l'arc"
- Multi-digit numbers of each length are smashed into the same word frequency, to remove meaningless differences and increase compatibility with word2vec. (Internally, their digits are replaced by zeroes.)
- Another new frequency-merging strategy (drop the highest and lowest, average the rest)
Version 1.5.1 (2016-08-19)
- Bug fix: Made it possible to load the Japanese or Korean dictionary when the other one is not available
Version 1.5.0 (2016-08-08)
- Include word frequencies learned from the Common Crawl
- Support Bulgarian, Catalan, Danish, Finnish, Hebrew, Hindi, Hungarian, Norwegian Bokmål, and Romanian
- Improve Korean with MeCab tokenization
- New frequency-merging strategy (weighted median)
- Include Wikipedia as a Chinese source (mostly Traditional)
- Include Reddit as a Spanish source
- Remove Greek Twitter because its data is poorly language-detected
- Add large lists in Arabic, Dutch, Italian
- Remove marks from more languages
- Deal with commas and cedillas in Turkish and Romanian
- Fix tokenization of Southeast and South Asian scripts
- Clean up Git history by removing unused large files
Version 1.4 (2016-06-02)
- Add large lists in English, German, Spanish, French, and Portuguese
- Add
zipf_frequency
function
Version 1.3 (2016-01-14)
- Add Reddit comments as an English source
Version 1.2 (2015-10-29)
- Add SUBTLEX data
- Better support for Chinese, using Jieba for tokenization, and mapping Traditional Chinese characters to Simplified
- Improve Greek
- Add Polish, Swedish, and Turkish
- Tokenizer can optionally preserve punctuation
- Detect when sources stripped "'t" off of English words, and repair their frequencies
Version 1.1 (2015-08-25)
- Use the 'regex' package to implement Unicode tokenization that's mostly consistent across languages
- Use NFKC normalization in Japanese and Arabic
Version 1.0 (2015-07-28)
- Create compact word frequency lists in English, Arabic, German, Spanish, French, Indonesian, Japanese, Malay, Dutch, Portuguese, and Russian
- Marginal support for Greek, Korean, Chinese
- Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably large downloads