wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Elia Robyn Speer	cc4f39d8c2	Merge remote-tracking branch 'origin/apostrophe-consistency'	2021-09-02 18:13:53 +00:00
Elia Robyn Speer	dc9585766a	use ftfy's uncurl_quotes in lossy_tokenize	2021-09-02 17:47:47 +00:00
Robyn Speer	f885a60bf0	Remove Malayalam; support for it isn't ready There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.	2021-03-30 14:10:58 -04:00
Robyn Speer	08b6cea451	Update table, remove Galician (only two sources)	2021-03-30 13:17:36 -04:00
Robyn Speer	cb78887446	remove Swahili, the data isn't reliable	2021-03-29 18:15:58 -04:00
Robyn Speer	d1949a486a	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Robyn Speer	6b97d093b6	specifically test that the long sequence underflows to 0	2021-02-18 15:09:31 -05:00
Robyn Speer	bd57b64d00	change math for INFERRED_SPACE_FACTOR to not overflow	2021-02-18 14:44:39 -05:00
Robyn Speer	174ecf580a	update dependencies and test for consistent results	2020-09-08 16:03:33 -04:00
Robyn Speer	96e7792a4a	fix regex's inconsistent word breaking around apostrophes	2020-04-28 15:19:56 -04:00
Rob Speer	d06a6a48c5	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Rob Speer	b2d242e8bf	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Rob Speer	c3b32b3c4a	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Rob Speer	2f6b87c86b	relax the test that assumed the Chinese list has few ASCII words	2018-06-15 16:29:15 -04:00
Rob Speer	57f676f4a6	fixes to tests, including that 'test.py' wasn't found by pytest	2018-06-15 15:48:41 -04:00
Rob Speer	93e3e03c60	update tests to include new languages Also, it's easy to say `>=` in pytest	2018-06-12 17:55:44 -04:00
Rob Speer	96a01b9685	port remaining tests to pytest	2018-06-01 16:40:51 -04:00
Rob Speer	863d5be522	port test.py and test_chinese.py to pytest	2018-06-01 16:33:06 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Rob Speer	6f1a9aaff1	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Rob Speer	1594ba3ad6	Test that we can leave the wordlist unspecified and get 'large' freqs	2018-03-08 18:09:57 -05:00
Rob Speer	47dac3b0b8	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Rob Speer	5a5acec9ff	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Rob Speer	67e4475763	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Rob Speer	45b9bcdbcb	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Rob Speer	e3352392cc	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Rob Speer	dcef5813b3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Rob Speer	d6cdef6039	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Rob Speer	f03a37e19c	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Rob Speer	4dfa800cd8	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Rob Speer	f671a1db7f	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Rob Speer	99eac54b31	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Rob Speer	b3e5d1c9e9	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Rob Speer	24e26c4c1d	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Rob Speer	d18b149262	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Rob Speer	f285430c84	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Rob Speer	02e2430dfb	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Rob Speer	99b627a300	Revise multilingual tests Former-commit-id: `21246f881f`	2016-07-29 12:19:12 -04:00
Rob Speer	9758c69ff0	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Rob Speer	a0893af82e	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Rob Speer	ac24b8eab4	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Rob Speer	c3fd3bd734	fix Arabic test, where 'lol' is no longer common Former-commit-id: `da79dfb247`	2016-05-11 17:01:47 -04:00
Rob Speer	c2eab6881e	move Thai test to where it makes more sense Former-commit-id: `4ec6b56faa`	2016-03-10 11:56:15 -05:00
Rob Speer	a32162c04f	Leave Thai segments alone in the default regex Our regex already has a special case to leave Chinese and Japanese alone when an appropriate tokenizer for the language isn't being used, as Unicode's default segmentation would make every character into its own token. The same thing happens in Thai, and we don't even have an appropriate tokenizer for Thai, so I've added a similar fallback. Former-commit-id: `07f16e6f03`	2016-02-22 14:32:59 -05:00
Rob Speer	f89ac5e400	test_chinese: fix typo in comment Former-commit-id: `2a84a926f5`	2015-09-24 13:41:11 -04:00
Rob Speer	faf66e9b08	Merge branch 'master' into chinese-external-wordlist Conflicts: wordfreq/chinese.py Former-commit-id: `cea2a61444`	2015-09-24 13:40:08 -04:00
Andrew Lin	ee6df56514	Revert a small syntax change introduced by a circular series of changes. Former-commit-id: `09597b7cf3`	2015-09-24 13:24:11 -04:00
Rob Speer	1b7117952b	don't apply the inferred-space penalty to Japanese Former-commit-id: `db5eda6051`	2015-09-24 12:50:06 -04:00
Rob Speer	963e0ff785	refactor the tokenizer, add `include_punctuation` option Former-commit-id: `e8e6e0a231`	2015-09-15 13:26:09 -04:00
Rob Speer	e3a79ab8c9	add `external_wordlist` option to tokenize Former-commit-id: `669bd16c13`	2015-09-10 18:09:41 -04:00

1 2

86 Commits