wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	d68d4baad2	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Robyn Speer	c5f64a5de8	update the README	2018-03-08 18:16:15 -05:00
Robyn Speer	72646f16a1	minor fixes to README	2018-02-28 16:14:50 -05:00
Robyn Speer	ec9c94be92	update citation to v1.7	2017-09-27 13:36:30 -04:00
Robyn Speer	fb4a7db6f7	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Robyn Speer	19b72132e7	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Robyn Speer	93306e55a0	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Robyn Speer	3cb3c38f47	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	7fabbfef31	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	1ac6795709	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Robyn Speer	a9a4483ca3	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Robyn Speer	36885b5479	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Robyn Speer	cecf852040	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Robyn Speer	6344b38194	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Robyn Speer	c9693c9502	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Robyn Speer	f3f66508bd	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Robyn Speer	8e963dc312	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Robyn Speer	6802a4f89d	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Robyn Speer	f2be213933	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Robyn Speer	872556f7bb	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Robyn Speer	3dd70ed1c2	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Robyn Speer	1d3521dfda	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Robyn Speer	c1f27d3095	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	e77c2dbca8	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Robyn Speer	032fea27c3	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	37e510345d	update README with additional SUBTLEX support Former-commit-id: `81bbe663fb`	2015-09-04 13:23:33 -04:00
Robyn Speer	3cb4dd777e	expand list of sources and supported languages Former-commit-id: `d9a1c34d00`	2015-09-04 01:03:36 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Robyn Speer	d267e0967c	add SUBTLEX to the readme Former-commit-id: `e6a2886a66`	2015-09-03 18:56:56 -04:00
Robyn Speer	942761d2f6	fix heading Former-commit-id: `00a2812907`	2015-08-28 17:49:38 -04:00
Robyn Speer	7bdffaae5c	fix list formatting Former-commit-id: `93f44683c5`	2015-08-28 17:49:07 -04:00
Robyn Speer	44c655d9a6	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Robyn Speer	a3a3180bb9	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Joshua Chin	4c7910246e	no use for use Former-commit-id: `b0a9a2980f`	2015-07-17 14:46:40 -04:00
Andrew Lin	383963f8a9	Document the version of Unicode used to build the regexes. Former-commit-id: `9f8464c2d1`	2015-07-08 18:48:33 -04:00
Robyn Speer	a3cc8d403c	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Robyn Speer	860e929bf8	update Japanese data; test Japanese and token combining Former-commit-id: `611a6a35de`	2015-05-28 14:01:56 -04:00

45 Commits