wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Author	SHA1	Message	Date
Rob Speer	cd434b2219	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Rob Speer	a6bb267f89	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Rob Speer	bac3dcb620	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Rob Speer	49a603ea63	update the README	2018-03-08 18:16:15 -05:00
Rob Speer	846606d892	minor fixes to README	2018-02-28 16:14:50 -05:00
Rob Speer	843ed92223	update citation to v1.7	2017-09-27 13:36:30 -04:00
Rob Speer	396b0f78df	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Rob Speer	803ebc25bb	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Rob Speer	872eeb8848	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Rob Speer	1519df503c	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Rob Speer	c1927732d3	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	4e4c77e7d7	fix to README: we're only using Reddit in English Former-commit-id: `dcb77a552b`	2016-05-11 15:38:29 -04:00
Rob Speer	f4aa2cad7b	fix table showing marginal Korean support Former-commit-id: `697842b3f9`	2016-03-30 15:11:13 -04:00
Rob Speer	758e37af07	make an example clearer with wordlist='large' Former-commit-id: `ed32b278cc`	2016-03-30 15:08:32 -04:00
Rob Speer	c82073270b	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Rob Speer	23c5c4adca	Add and document large wordlists Former-commit-id: `d79ee37da9`	2016-01-22 16:23:43 -05:00
Rob Speer	8fea2ca181	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `1793c1bb2e`	2015-09-28 14:34:59 -04:00
Rob Speer	3bd1fe2fe6	Fix documentation and clean up, based on Sep 25 code review Former-commit-id: `44b0c4f9ba`	2015-09-28 12:58:46 -04:00
Rob Speer	7c596de98a	describe optional dependencies better in the README Former-commit-id: `b460eef444`	2015-09-24 17:54:52 -04:00
Rob Speer	76c4a8975a	fix README conflict Former-commit-id: `5b918e7bb0`	2015-09-22 14:23:55 -04:00
Rob Speer	7f92557a58	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py Former-commit-id: `3cb3061e06`	2015-09-10 15:27:33 -04:00
Rob Speer	a13f459f88	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Rob Speer	9c08442dc5	fixes based on code review notes Former-commit-id: `354555514f`	2015-09-09 13:10:18 -04:00
Rob Speer	37e5e1009f	fix SUBTLEX citations Former-commit-id: `6502f15e9b`	2015-09-08 17:45:25 -04:00
Rob Speer	0f9497d864	take out OpenSubtitles for Chinese Former-commit-id: `d9c44d5fcc`	2015-09-08 17:25:05 -04:00
Rob Speer	b4100b5bfb	update the README for Chinese Former-commit-id: `d576e3294b`	2015-09-05 03:42:54 -04:00
Rob Speer	e2a3758832	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Rob Speer	62f5a8eb1e	add Polish and Swedish to README Former-commit-id: `3c3371a9ff`	2015-09-04 17:10:40 -04:00
Rob Speer	138e8aaa3f	add more citations Former-commit-id: `8196643509`	2015-09-04 15:57:40 -04:00
Rob Speer	c08e593234	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Rob Speer	a0997a79a4	update README with additional SUBTLEX support Former-commit-id: `81bbe663fb`	2015-09-04 13:23:33 -04:00
Rob Speer	bf88f97744	expand list of sources and supported languages Former-commit-id: `d9a1c34d00`	2015-09-04 01:03:36 -04:00
Rob Speer	a6ef3224a6	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Rob Speer	a92c398258	add SUBTLEX to the readme Former-commit-id: `e6a2886a66`	2015-09-03 18:56:56 -04:00
Rob Speer	d883eaeca5	fix heading Former-commit-id: `00a2812907`	2015-08-28 17:49:38 -04:00
Rob Speer	390a431181	fix list formatting Former-commit-id: `93f44683c5`	2015-08-28 17:49:07 -04:00
Rob Speer	43fd15c938	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Rob Speer	d064fbec7d	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Joshua Chin	45799955ab	no use for use Former-commit-id: `b0a9a2980f`	2015-07-17 14:46:40 -04:00
Andrew Lin	8961729401	Document the version of Unicode used to build the regexes. Former-commit-id: `9f8464c2d1`	2015-07-08 18:48:33 -04:00
Rob Speer	51f4e4c826	add installation instructions to the readme Former-commit-id: `0f4ca80026`	2015-05-28 14:02:12 -04:00
Rob Speer	1f41cb083c	update Japanese data; test Japanese and token combining Former-commit-id: `611a6a35de`	2015-05-28 14:01:56 -04:00

49 Commits