wordfreq/CHANGELOG.md at ff5a8f2a653a47c5a2c16ef0d25d470df853782d

iskm/wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Robyn Speer c0fbd844f6 Add a changelog

2016-08-22 12:41:39 -04:00

Version 1.5.1 (2016-08-19)

Bug fix: Made it possible to load the Japanese or Korean dictionary when the other one is not available

Include word frequencies learned from the Common Crawl
Support Bulgarian, Catalan, Danish, Finnish, Hebrew, Hindi, Hungarian, Norwegian Bokmål, and Romanian
Improve Korean with MeCab tokenization
New frequency-merging strategy (weighted median)
Include Wikipedia as a Chinese source (mostly Traditional)
Include Reddit as a Spanish source
Remove Greek Twitter because its data is poorly language-detected
Add large lists in Arabic, Dutch, Italian
Remove marks from more languages
Deal with commas and cedillas in Turkish and Romanian
Fix tokenization of Southeast and South Asian scripts
Clean up Git history by removing unused large files

Add SUBTLEX data
Better support for Chinese, using Jieba for tokenization, and mapping Traditional Chinese characters to Simplified
Improve Greek
Add Polish, Swedish, and Turkish
Tokenizer can optionally preserve punctuation
Detect when sources stripped "'t" off of English words, and repair their frequencies

Use the 'regex' package to implement Unicode tokenization that's mostly consistent across languages
Use NFKC normalization in Japanese and Arabic

Create compact word frequency lists in English, Arabic, German, Spanish, French, Indonesian, Japanese, Malay, Dutch, Portuguese, and Russian
Marginal support for Greek, Korean, Chinese
Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably large downloads