wordfreq/CHANGELOG.md at 53dc0bbb1a10494d71728e0e078e35b8dd2c21d3

iskm/wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Robyn Speer 46e32fbd36 v1.7: update tokenization, update data, add bn and mk

2017-08-25 17:37:48 -04:00

Version 1.7.0 (2017-08-25)

Tokenization will always keep Unicode graphemes together, including complex emoji introduced in Unicode 10
Update the Wikipedia source data to April 2017
Remove some non-words, such as the Unicode replacement character and the pilcrow sign, from frequency lists
Support Bengali and Macedonian, which passed the threshold of having enough source data to be included

Depend on langcodes 1.4, with a new language-matching system that does not depend on SQLite.

This prevents silly conflicts where langcodes' SQLite connection was preventing langcodes from being used in threads.

Support Czech, Persian, Ukrainian, and Croatian/Bosnian/Serbian
Add large lists in Chinese, Finnish, Japanese, and Polish
Data is now collected and built using Exquisite Corpus (https://github.com/LuminosoInsight/exquisite-corpus)
Add word frequencies from OPUS OpenSubtitles 2016
Add word frequencies from the MOKK Hungarian Webcorpus
Expand Google Books Ngrams data to cover 8 languages
Expand language detection on Reddit to cover 13 languages with large enough Reddit communities
Drop the Common Crawl; we have enough good sources now that we don't have to deal with all that spam
Add automatic transliteration of Serbian text
Adjust tokenization of apostrophes next to vowel sounds: the French word "l'heure" is now tokenized similarly to "l'arc"
Multi-digit numbers of each length are smashed into the same word frequency, to remove meaningless differences and increase compatibility with word2vec. (Internally, their digits are replaced by zeroes.)
Another new frequency-merging strategy (drop the highest and lowest, average the rest)

Bug fix: Made it possible to load the Japanese or Korean dictionary when the other one is not available

Include word frequencies learned from the Common Crawl
Support Bulgarian, Catalan, Danish, Finnish, Hebrew, Hindi, Hungarian, Norwegian Bokmål, and Romanian
Improve Korean with MeCab tokenization
New frequency-merging strategy (weighted median)
Include Wikipedia as a Chinese source (mostly Traditional)
Include Reddit as a Spanish source
Remove Greek Twitter because its data is poorly language-detected
Add large lists in Arabic, Dutch, Italian
Remove marks from more languages
Deal with commas and cedillas in Turkish and Romanian
Fix tokenization of Southeast and South Asian scripts
Clean up Git history by removing unused large files

Add SUBTLEX data
Better support for Chinese, using Jieba for tokenization, and mapping Traditional Chinese characters to Simplified
Improve Greek
Add Polish, Swedish, and Turkish
Tokenizer can optionally preserve punctuation
Detect when sources stripped "'t" off of English words, and repair their frequencies

Use the 'regex' package to implement Unicode tokenization that's mostly consistent across languages
Use NFKC normalization in Japanese and Arabic

Create compact word frequency lists in English, Arabic, German, Spanish, French, Indonesian, Japanese, Malay, Dutch, Portuguese, and Russian
Marginal support for Greek, Korean, Chinese
Fresh start, dropping compatibility with wordfreq 0.x and its unreasonably large downloads