wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Lance Nathan	18f176dbf6	Merge pull request #55 from LuminosoInsight/version2 Version 2, with standalone text pre-processing	2018-03-15 14:26:49 -04:00
Robyn Speer	d9bc4af8cd	update the changelog	2018-03-14 17:56:29 -04:00
Robyn Speer	b2663272a7	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Robyn Speer	65811d587e	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
Robyn Speer	2ecf31ee81	Actually use `min_score` in `_language_in_list` We don't need to set it to any value but 80 now, but we will need to if we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and unified zh-Hani).	2018-03-14 15:08:52 -04:00
Robyn Speer	c57032d5cb	code review fixes to wordfreq.tokens	2018-03-14 15:07:45 -04:00
Robyn Speer	de81a23b9d	code review fixes to __init__	2018-03-14 15:04:59 -04:00
Robyn Speer	8656688b0b	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Robyn Speer	d68d4baad2	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Robyn Speer	0cb36aa74f	cache the language info (avoids 10x slowdown)	2018-03-09 14:54:03 -05:00
Robyn Speer	b162de353d	avoid log spam: only warn about an unsupported language once	2018-03-09 11:50:15 -05:00
Robyn Speer	c5f64a5de8	update the README	2018-03-08 18:16:15 -05:00
Robyn Speer	d8e3669a73	wordlist updates from new exquisite-corpus	2018-03-08 18:16:00 -05:00
Robyn Speer	53dc0bbb1a	Test that we can leave the wordlist unspecified and get 'large' freqs	2018-03-08 18:09:57 -05:00
Robyn Speer	8e3dff3c1c	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Robyn Speer	45064a292f	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Robyn Speer	fe85b4e124	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Robyn Speer	a4d9614e39	setup: update version number and dependencies	2018-03-08 16:26:24 -05:00
Robyn Speer	5ab5d2ea55	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Robyn Speer	72646f16a1	minor fixes to README	2018-02-28 16:14:50 -05:00
Robyn Speer	cd7bfc4060	Merge pull request #54 from LuminosoInsight/fix-deps Fix setup.py (version number and msgpack dependency)	2018-02-28 12:46:46 -08:00
Robyn Speer	208559ae1e	bump version to 1.7.0, belatedly	2018-02-28 15:15:47 -05:00
Robyn Speer	98cb47c774	update msgpack-python dependency to msgpack	2018-02-28 15:14:51 -05:00
Robyn Speer	ec9c94be92	update citation to v1.7	2017-09-27 13:36:30 -04:00
Andrew Lin	95a13ab4ce	Merge pull request #51 from LuminosoInsight/version1.7 Version 1.7: update tokenization, update Wikipedia data, add languages	2017-09-08 17:02:05 -04:00
Robyn Speer	b042f2be9d	remove unnecessary enumeration from top_n.py	2017-09-08 16:52:06 -04:00
Robyn Speer	fb4a7db6f7	update README for 1.7; sort language list in English order	2017-08-25 17:38:31 -04:00
Robyn Speer	46e32fbd36	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Robyn Speer	9dac967ca3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Andrew Lin	6c118c0b6a	Merge pull request #49 from LuminosoInsight/restore-langcodes Use langcodes when tokenizing again	2017-05-10 16:20:06 -04:00
Robyn Speer	aa3ed23282	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Robyn Speer	71a0ad6abb	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Robyn Speer	ae7bc5764b	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	c2e1504643	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	1363f9d2e0	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	72e3678e89	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Robyn Speer	a099a5a881	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Robyn Speer	7dec335f74	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Robyn Speer	19b72132e7	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Robyn Speer	abd0820a32	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Robyn Speer	93306e55a0	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Robyn Speer	9a6beb0089	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Robyn Speer	573ecc53d0	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Robyn Speer	3cb3c38f47	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Robyn Speer	86f22e8523	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Robyn Speer	48a5967e9a	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	23c7c8e936	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Robyn Speer	7dc3f03ebd	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Robyn Speer	de32a15b4f	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00

1 2 3 4 5 ...

629 Commits