Commit Graph

69 Commits

Author SHA1 Message Date
Elia Robyn Lake
ed7dccbf8b update version and documentation 2022-03-10 19:12:45 -05:00
Robyn Speer
c244ff0d10 readme update: web text comes from OSCAR 2021-04-15 14:45:29 -04:00
Robyn Speer
16122083b3 XC was built without Russian Web data; reflect this in the table
The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in
the exquisite-corpus build.
2021-04-14 14:28:12 -04:00
Robyn Speer
b6614c1a33 Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5 2021-04-14 14:26:54 -04:00
Robyn Speer
08816a21d1 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
9bab1024b7 add OSCAR citation 2021-03-30 12:56:10 -04:00
Robyn Speer
fea45fd501 Merge remote-tracking branch 'origin/master' into data-update-2.5 2021-03-30 12:53:09 -04:00
Robyn Speer
00e60df106 Merge branch 'master' into data-update-2.5 2021-03-29 16:42:24 -04:00
Robyn Speer
fc5c4cdda8 small documentation fixes 2021-03-29 16:41:47 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Robyn Speer
168bb2a6ed fix version, update instructions and changelog 2021-02-18 18:25:16 -05:00
Robyn Speer
fd0ac9a272 update README examples 2020-10-01 16:05:43 -04:00
Robyn Speer
563e8f7444 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Robyn Speer
f73406c69a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Robyn Speer
86b928f967 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Robyn Speer
830157d8e4 Fix instructions and search path for mecab-ko-dic
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.

I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.

When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00
Robyn Speer
c6552f923f update README and CHANGELOG 2018-06-18 15:21:43 -04:00
Robyn Speer
7a32b56c1c Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Robyn Speer
39a1308770 update table in README: Dutch has 5 sources 2018-06-18 11:43:52 -04:00
Robyn Speer
e4cb9a23b6 update data to include xc's processing of ParaCrawl 2018-05-25 16:12:35 -04:00
Robyn Speer
8907423147 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Robyn Speer
666f7e51fa Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Robyn Speer
8656688b0b fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Robyn Speer
d68d4baad2 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Robyn Speer
c5f64a5de8 update the README 2018-03-08 18:16:15 -05:00
Robyn Speer
72646f16a1 minor fixes to README 2018-02-28 16:14:50 -05:00
Robyn Speer
ec9c94be92 update citation to v1.7 2017-09-27 13:36:30 -04:00
Robyn Speer
fb4a7db6f7 update README for 1.7; sort language list in English order 2017-08-25 17:38:31 -04:00
Robyn Speer
19b72132e7 Fix some outdated numbers in English examples 2017-01-31 18:25:41 -05:00
Robyn Speer
93306e55a0 Update README with new examples and URL 2017-01-09 15:13:19 -05:00
Robyn Speer
3cb3c38f47 update the README, citing OpenSubtitles 2016 2017-01-06 19:04:40 -05:00
Robyn Speer
39e459ac71 Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Robyn Speer
7fabbfef31 Describe how to cite wordfreq
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Robyn Speer
2787bfd647 stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
94712c8312 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
1ac6795709 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Robyn Speer
a9a4483ca3 fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Robyn Speer
36885b5479 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Robyn Speer
cecf852040 update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Robyn Speer
6344b38194 Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
c9693c9502 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Robyn Speer
f3f66508bd Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Robyn Speer
8e963dc312 describe optional dependencies better in the README
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
6802a4f89d fix README conflict
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Robyn Speer
f2be213933 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
f0c7c3a02c Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Robyn Speer
872556f7bb fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00