wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	16122083b3	XC was built without Russian Web data; reflect this in the table The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in the exquisite-corpus build.	2021-04-14 14:28:12 -04:00
Robyn Speer	b6614c1a33	Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5	2021-04-14 14:26:54 -04:00
Robyn Speer	08816a21d1	Remove Malayalam; support for it isn't ready There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.	2021-03-30 14:10:58 -04:00
Robyn Speer	90f0e0a88e	Update table, remove Galician (only two sources)	2021-03-30 13:17:36 -04:00
Robyn Speer	9bab1024b7	add OSCAR citation	2021-03-30 12:56:10 -04:00
Robyn Speer	fea45fd501	Merge remote-tracking branch 'origin/master' into data-update-2.5	2021-03-30 12:53:09 -04:00
Robyn Speer	00e60df106	Merge branch 'master' into data-update-2.5	2021-03-29 16:42:24 -04:00
Robyn Speer	fc5c4cdda8	small documentation fixes	2021-03-29 16:41:47 -04:00
Robyn Speer	ec48c0a123	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	168bb2a6ed	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	fd0ac9a272	update README examples	2020-10-01 16:05:43 -04:00
Robyn Speer	563e8f7444	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Robyn Speer	f73406c69a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Robyn Speer	86b928f967	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Robyn Speer	830157d8e4	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00
Robyn Speer	c6552f923f	update README and CHANGELOG	2018-06-18 15:21:43 -04:00
Robyn Speer	7a32b56c1c	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Robyn Speer	39a1308770	update table in README: Dutch has 5 sources	2018-06-18 11:43:52 -04:00
Robyn Speer	e4cb9a23b6	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Robyn Speer	8907423147	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Robyn Speer	666f7e51fa	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Robyn Speer	8656688b0b	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Robyn Speer	d68d4baad2	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Robyn Speer	c5f64a5de8	update the README	2018-03-08 18:16:15 -05:00
Robyn Speer	72646f16a1	minor fixes to README	2018-02-28 16:14:50 -05:00
Robyn Speer	ec9c94be92	update citation to v1.7	2017-09-27 13:36:30 -04:00
Robyn Speer	fb4a7db6f7	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Robyn Speer	19b72132e7	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Robyn Speer	93306e55a0	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Robyn Speer	3cb3c38f47	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	7fabbfef31	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	1ac6795709	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Robyn Speer	a9a4483ca3	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Robyn Speer	36885b5479	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Robyn Speer	cecf852040	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	c9693c9502	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Robyn Speer	f3f66508bd	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Robyn Speer	8e963dc312	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Robyn Speer	6802a4f89d	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	3dd70ed1c2	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00

1 2

67 Commits