wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Elia Robyn Lake	71f2757b8b	packaging updates	2022-03-11 10:43:37 -05:00
Elia Robyn Lake	f893435b75	documentation updates	2022-03-10 19:22:53 -05:00
Elia Robyn Lake	981fab53aa	add py.typed	2022-03-10 19:16:38 -05:00
Elia Robyn Lake	ed7dccbf8b	update version and documentation	2022-03-10 19:12:45 -05:00
Elia Robyn Lake	bf05b1b1dc	estimate the freq distribution of numbers	2022-03-10 18:33:42 -05:00
Elia Robyn Lake	4e373750e8	move notes to self into notes/	2022-03-09 17:22:36 -05:00
Elia Robyn Lake	f800ff9bcc	work on rel. frequencies of numbers, and other features	2022-02-18 11:33:28 -05:00
Elia Robyn Lake	ef4d6fe0df	run black	2022-02-08 18:27:18 -05:00
Elia Robyn Lake	3c4819e7e5	update packaging, try to handle digits better	2022-02-08 18:24:36 -05:00
Elia Robyn Speer	2361606b3a	fix merge conflict markers in setup	2021-09-02 21:49:49 +00:00
Elia Robyn Speer	b60ac1b803	Merge remote-tracking branch 'origin/apostrophe-consistency'	2021-09-02 18:13:53 +00:00
Elia Robyn Speer	c2a9fe03f1	use ftfy's uncurl_quotes in lossy_tokenize	2021-09-02 17:47:47 +00:00
Robyn Speer	6f1f626f1b	update email address	2021-08-23 17:46:34 -04:00
Robyn Speer	c244ff0d10	readme update: web text comes from OSCAR	2021-04-15 14:45:29 -04:00
Sara Jewett	b13d35e503	Merge pull request #91 from LuminosoInsight/data-update-2.5 Version 2.5, incorporating OSCAR data	2021-04-15 14:32:10 -04:00
Robyn Speer	16122083b3	XC was built without Russian Web data; reflect this in the table The Russian sub-corpus of OSCAR is corrupted, so we skipped over it in the exquisite-corpus build.	2021-04-14 14:28:12 -04:00
Robyn Speer	b6614c1a33	Merge branch 'data-update-2.5' of github.com:LuminosoInsight/wordfreq into data-update-2.5	2021-04-14 14:26:54 -04:00
Robyn Speer	08816a21d1	Remove Malayalam; support for it isn't ready There are Unicode normalization problems with Malayalam -- as best I understand it, Unicode simply neglected to include normalization forms for Malayalam "chillu" characters even though they changed how they're represented in Unicode 5.1 and again in Unicode 9. The result is that words that print the same end up with multiple entries, with different codepoint sequences that don't normalize to each other. I certainly don't know how to resolve this, and it would need to be resolved to have something that we could reasonably call Malayalam word frequencies.	2021-03-30 14:10:58 -04:00
Robyn Speer	90f0e0a88e	Update table, remove Galician (only two sources)	2021-03-30 13:17:36 -04:00
Robyn Speer	9bab1024b7	add OSCAR citation	2021-03-30 12:56:10 -04:00
Robyn Speer	fea45fd501	Merge remote-tracking branch 'origin/master' into data-update-2.5	2021-03-30 12:53:09 -04:00
Robyn Speer	8777ad0811	remove Swahili, the data isn't reliable	2021-03-29 18:15:58 -04:00
Robyn Speer	00e60df106	Merge branch 'master' into data-update-2.5	2021-03-29 16:42:24 -04:00
Robyn Speer	fc5c4cdda8	small documentation fixes	2021-03-29 16:41:47 -04:00
Robyn Speer	ec48c0a123	update data and tests for 2.5	2021-03-29 16:18:08 -04:00
Lance Nathan	32093d9efc	Merge pull request #89 from LuminosoInsight/dependencies-and-tokens Rework CJK dependencies and fix a tokenization bug	2021-02-23 15:15:17 -05:00
Robyn Speer	168bb2a6ed	fix version, update instructions and changelog	2021-02-18 18:25:16 -05:00
Robyn Speer	de636a804e	Use Python packages to find dictionaries for MeCab	2021-02-18 18:18:06 -05:00
Robyn Speer	ed23bf3ebe	specifically test that the long sequence underflows to 0	2021-02-18 15:09:31 -05:00
Robyn Speer	75a56b68fb	change math for INFERRED_SPACE_FACTOR to not overflow	2021-02-18 14:44:39 -05:00
Lance Nathan	7318f58df9	Merge pull request #88 from LuminosoInsight/version2.4 work with langcodes 3.0, without language_data	2021-02-09 17:36:09 -05:00
Robyn Speer	ad3a5c533f	work with langcodes 3.0, without language_data	2021-02-09 17:27:22 -05:00
Robyn Speer	53b1ee2fa0	Merge pull request #84 from LuminosoInsight/add-initial-vowels Update the "initial vowels" in French/Catalan	2021-02-03 13:47:30 -05:00
Lance Nathan	a31deec580	Update the "initial vowels" in French/Catalan User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82) that "Œ and œ should be considered as vowels that might appear at the start of a word in French". Further investigation of the French wordfreq list revealed words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle, d'özil). This PR is a combination of LBeaudoux's PR and the latter fact. (The updated regex is also used for Catalan, but should have no actual effect. To the best of our understanding, "y" appears in Catalan only in the digraph "ny" and in foreign words--the Catalan wordlist contains "york", "by", "city", several English names, and so forth, but no real Catalan words starting with "y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba", but cases of "l'" followed by a vowel like the ones found in French.)	2020-10-08 12:23:22 -04:00
Robyn Speer	c8229a5378	update the changelog	2020-10-01 16:12:41 -04:00
Robyn Speer	fd0ac9a272	update README examples	2020-10-01 16:05:43 -04:00
Robyn Speer	8c00a3c500	updated frequency data	2020-09-30 17:56:12 -04:00
Robyn Speer	ad02d96f1b	update dependencies and test for consistent results	2020-09-08 16:03:33 -04:00
Lance Nathan	ca4681b361	Merge pull request #77 from LuminosoInsight/regex-apostrophe-fix Fix regex's inconsistent word breaking around apostrophes	2020-04-28 16:19:40 -04:00
Robyn Speer	0ff812a711	update version and changelog	2020-04-28 15:24:24 -04:00
Robyn Speer	13ce4606b2	fix regex's inconsistent word breaking around apostrophes	2020-04-28 15:19:56 -04:00
Robyn Speer	86ae2a610f	update CHANGELOG for 2.3.1	2020-04-22 11:12:02 -04:00
Robyn Speer	26b4175f3b	packaging fix: require msgpack >= 1.0	2020-04-22 11:10:03 -04:00
Lance Nathan	7c537134ae	Merge pull request #75 from LuminosoInsight/language-match-update use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-20 14:48:58 -04:00
Robyn Speer	d45bcf97de	update changelog for 2.3	2020-04-16 15:51:20 -04:00
Robyn Speer	bf795e6d6c	use langcodes 2.0 and deprecate 'match_cutoff'	2020-04-16 14:09:30 -04:00
Moss Collum	40443c9a3b	Merge pull request #74 from LuminosoInsight/msgpack-1.0-bugfix Fix code affected by a breaking change in msgpack 1.0	2020-02-28 13:05:37 -05:00
Lance Nathan	45a002c1e1	Fix code affected by a breaking change in msgpack 1.0 The msgpack readme explains: "Default value of strict_map_key is changed to True to avoid hashdos. You need to pass strict_map_key=False if you have data which contain map keys which type is not bytes or str." chinese.py loads SIMPLIFIED_MAP from disk. Since it is a str.translate dictionary, its keys are numbers. And since it's a dictionary we created ourselves, there's no hashdos concern, so we can load it with strict_map_key=False.	2020-02-28 13:02:45 -05:00
Lance Nathan	e043ebb481	Merge pull request #73 from LuminosoInsight/add-mailmap Add a mailmap	2019-12-18 13:59:36 -05:00
Robyn Speer	feab8b77fb	add a mailmap	2019-12-18 13:52:22 -05:00

1 2 3 4 5 ...

626 Commits