wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Rob Speer	90b5246a48	commit new data files (Italian changed for some reason)	2018-05-29 17:36:48 -04:00
Rob Speer	cd434b2219	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Lance Nathan	968bc3a85a	Merge pull request #56 from LuminosoInsight/japanese-edge-cases Handle Japanese edge cases in `simple_tokenize`	2018-05-01 14:57:45 -04:00
Rob Speer	0a95d96b20	update CHANGELOG for 2.0.1	2018-05-01 14:47:55 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Lance Nathan	e3a1b470d9	Merge pull request #55 from LuminosoInsight/version2 Version 2, with standalone text pre-processing	2018-03-15 14:26:49 -04:00
Rob Speer	a759f38540	update the changelog	2018-03-14 17:56:29 -04:00
Rob Speer	6f1a9aaff1	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Rob Speer	1a761199cd	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
Rob Speer	b2bdc8a854	Actually use `min_score` in `_language_in_list` We don't need to set it to any value but 80 now, but we will need to if we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and unified zh-Hani).	2018-03-14 15:08:52 -04:00
Rob Speer	bb2096ae04	code review fixes to wordfreq.tokens	2018-03-14 15:07:45 -04:00
Rob Speer	430fb01e53	code review fixes to __init__	2018-03-14 15:04:59 -04:00
Rob Speer	a6bb267f89	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Rob Speer	bac3dcb620	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Rob Speer	e64f409c55	cache the language info (avoids 10x slowdown)	2018-03-09 14:54:03 -05:00
Rob Speer	11e758672e	avoid log spam: only warn about an unsupported language once	2018-03-09 11:50:15 -05:00
Rob Speer	49a603ea63	update the README	2018-03-08 18:16:15 -05:00
Rob Speer	92784d1768	wordlist updates from new exquisite-corpus	2018-03-08 18:16:00 -05:00
Rob Speer	1594ba3ad6	Test that we can leave the wordlist unspecified and get 'large' freqs	2018-03-08 18:09:57 -05:00
Rob Speer	47dac3b0b8	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Rob Speer	5a5acec9ff	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Rob Speer	67e4475763	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Rob Speer	a42cf312ef	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Rob Speer	45b9bcdbcb	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Rob Speer	846606d892	minor fixes to README	2018-02-28 16:14:50 -05:00
Rob Speer	ad677e12fd	Merge pull request #54 from LuminosoInsight/fix-deps Fix setup.py (version number and msgpack dependency)	2018-02-28 12:46:46 -08:00
Rob Speer	aadb19c9a3	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Rob Speer	db56528fb6	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Rob Speer	843ed92223	update citation to v1.7	2017-09-27 13:36:30 -04:00
Andrew Lin	721a1e9fd9	Merge pull request #51 from LuminosoInsight/version1.7 Version 1.7: update tokenization, update Wikipedia data, add languages	2017-09-08 17:02:05 -04:00
Rob Speer	61b2e4062d	remove unnecessary enumeration from top_n.py	2017-09-08 16:52:06 -04:00
Rob Speer	396b0f78df	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Rob Speer	e3352392cc	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Andrew Lin	baf6771e97	Merge pull request #49 from LuminosoInsight/restore-langcodes Use langcodes when tokenizing again	2017-05-10 16:20:06 -04:00
Rob Speer	37b4914970	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Rob Speer	d6cdef6039	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Rob Speer	97042e6f60	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	f28a193015	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	e21bcc2a58	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	21b331e898	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Rob Speer	b5b653f0a1	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Rob Speer	391a723662	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Rob Speer	7fa5e7fc22	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Rob Speer	68e4ce16cf	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Rob Speer	e6114bf0fa	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Rob Speer	f03a37e19c	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	d2bb5b78f3	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00

1 2 3 4 5 ...

585 Commits