wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	2417ea0d39	XC was built without Russian Web data; reflect this in the table The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in the exquisite-corpus build.	2021-04-14 14:28:12 -04:00
Robyn Speer	81bb9f4338	Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5	2021-04-14 14:26:54 -04:00
Robyn Speer	f885a60bf0	Remove Malayalam; support for it isn't ready There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.	2021-03-30 14:10:58 -04:00
Robyn Speer	08b6cea451	Update table, remove Galician (only two sources)	2021-03-30 13:17:36 -04:00
Robyn Speer	8fd3d77e4f	add OSCAR citation	2021-03-30 12:56:10 -04:00
Robyn Speer	efdf110351	Merge remote-tracking branch 'origin/master' into data-update-2.5	2021-03-30 12:53:09 -04:00
Robyn Speer	ec2e148f8e	Merge branch 'master' into data-update-2.5	2021-03-29 16:42:24 -04:00
Robyn Speer	4263f1af14	small documentation fixes	2021-03-29 16:41:47 -04:00
Robyn Speer	d1949a486a	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	d99ac1051a	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	5986342bc6	update README examples	2020-10-01 16:05:43 -04:00
Robyn Speer	51ca052b62	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Rob Speer	0644c8920a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Rob Speer	d06a6a48c5	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Rob Speer	676686fda1	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00
Rob Speer	1dc763c9c5	update README and CHANGELOG	2018-06-18 15:21:43 -04:00
Rob Speer	c3b32b3c4a	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Rob Speer	2b85a1cef2	update table in README: Dutch has 5 sources	2018-06-18 11:43:52 -04:00
Rob Speer	cd434b2219	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Rob Speer	a6bb267f89	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Rob Speer	bac3dcb620	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Rob Speer	49a603ea63	update the README	2018-03-08 18:16:15 -05:00
Rob Speer	846606d892	minor fixes to README	2018-02-28 16:14:50 -05:00
Rob Speer	843ed92223	update citation to v1.7	2017-09-27 13:36:30 -04:00
Rob Speer	396b0f78df	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	872eeb8848	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Rob Speer	1519df503c	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Rob Speer	c1927732d3	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	4e4c77e7d7	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Rob Speer	f4aa2cad7b	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Rob Speer	758e37af07	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Rob Speer	c82073270b	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Rob Speer	23c5c4adca	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Rob Speer	8fea2ca181	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Rob Speer	3bd1fe2fe6	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Rob Speer	7c596de98a	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Rob Speer	76c4a8975a	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Rob Speer	7f92557a58	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Rob Speer	a13f459f88	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Rob Speer	9c08442dc5	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Rob Speer	37e5e1009f	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Rob Speer	0f9497d864	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00

1 2

67 Commits