wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	08816a21d1	Remove Malayalam; support for it isn't ready There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.	2021-03-30 14:10:58 -04:00
Robyn Speer	90f0e0a88e	Update table, remove Galician (only two sources)	2021-03-30 13:17:36 -04:00
Robyn Speer	8777ad0811	remove Swahili, the data isn't reliable	2021-03-29 18:15:58 -04:00
Robyn Speer	ec48c0a123	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	ed23bf3ebe	specifically test that the long sequence underflows to 0	2021-02-18 15:09:31 -05:00
Robyn Speer	75a56b68fb	change math for INFERRED_SPACE_FACTOR to not overflow	2021-02-18 14:44:39 -05:00
Robyn Speer	13ce4606b2	fix regex's inconsistent word breaking around apostrophes	2020-04-28 15:19:56 -04:00
Robyn Speer	86b928f967	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Robyn Speer	65692c3d81	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Robyn Speer	7a32b56c1c	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Robyn Speer	42efcfc1ad	relax the test that assumed the Chinese list has few ASCII words	2018-06-15 16:29:15 -04:00
Robyn Speer	ad0f046f47	fixes to tests, including that 'test.py' wasn't found by pytest	2018-06-15 15:48:41 -04:00
Robyn Speer	a975bcedae	update tests to include new languages Also, it's easy to say `>=` in pytest	2018-06-12 17:55:44 -04:00
Robyn Speer	b3c42be331	port remaining tests to pytest	2018-06-01 16:40:51 -04:00
Robyn Speer	75b4d62084	port test.py and test_chinese.py to pytest	2018-06-01 16:33:06 -04:00
Robyn Speer	666f7e51fa	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Robyn Speer	b2663272a7	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Robyn Speer	53dc0bbb1a	Test that we can leave the wordlist unspecified and get 'large' freqs	2018-03-08 18:09:57 -05:00
Robyn Speer	8e3dff3c1c	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Robyn Speer	45064a292f	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Robyn Speer	fe85b4e124	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Robyn Speer	5ab5d2ea55	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Robyn Speer	46e32fbd36	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Robyn Speer	9dac967ca3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Robyn Speer	71a0ad6abb	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Robyn Speer	9a6beb0089	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Robyn Speer	573ecc53d0	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Robyn Speer	7dc3f03ebd	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Robyn Speer	87b03325db	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Robyn Speer	6211b35fb3	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Robyn Speer	a8e2fa5acf	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Robyn Speer	21a78f5eb9	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Robyn Speer	4376636316	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Robyn Speer	ff5a8f2a65	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Robyn Speer	68c6d95131	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	c72326e4c0	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Robyn Speer	f25985379c	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Robyn Speer	51e260b713	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Robyn Speer	4a4534c466	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Robyn Speer	e15a231401	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	e7d46fb104	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Robyn Speer	4d00f17477	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Robyn Speer	9a007b9948	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Robyn Speer	1adbb1aaf1	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00
Robyn Speer	f0c7c3a02c	Lower the frequency of phrases with inferred token boundaries Former-commit-id: `5c8c36f4e3`	2015-09-10 14:16:22 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	4704131e13	add tests for Turkish Former-commit-id: `fc93c8dc9c`	2015-09-04 17:00:05 -04:00

1 2

83 Commits