wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-27 02:48:51 +00:00

Author	SHA1	Message	Date
Rob Speer	49bd631632	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Rob Speer	6286946cc3	correct the simple_tokenize docstring Former-commit-id: `f7babea352`	2015-08-26 13:54:50 -04:00
Rob Speer	232aee9c66	refactor the token expression Former-commit-id: `01b6403ef4`	2015-08-26 13:40:47 -04:00
Rob Speer	40d6b85d67	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Rob Speer	7a757d9ec9	remove regex files that are no longer needed Former-commit-id: `94467a6563`	2015-08-26 11:48:11 -04:00
Rob Speer	1f5c828642	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Rob Speer	d064fbec7d	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Rob Speer	244735ce4d	updated data Former-commit-id: `353b8045da`	2015-08-25 17:16:03 -04:00
Rob Speer	a3b37f6619	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Rob Speer	1042f87efe	exclude 'extenders' from the start of the token Former-commit-id: `a8e7c29068`	2015-08-25 12:33:12 -04:00
Rob Speer	a5b8c5a745	update frequency lists Former-commit-id: `0d600bdf27`	2015-08-25 11:43:59 -04:00
Rob Speer	99a312ce06	Exclude math and modifier symbols as tokens Former-commit-id: `8f3c9f576c`	2015-08-25 11:43:22 -04:00
Rob Speer	6647cf9035	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Rob Speer	decd7dae60	also NFKC-normalize Japanese input Former-commit-id: `554455699d`	2015-08-24 18:13:03 -04:00
Rob Speer	9178c6de37	only NFKC-normalize in Arabic Former-commit-id: `1d055edc1c`	2015-08-24 17:55:17 -04:00
Rob Speer	6a33b46cfd	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Rob Speer	759a8199fb	remove obsolete gen_regex.py Former-commit-id: `102bc715ae`	2015-08-24 17:11:18 -04:00
Rob Speer	f4cf46ab9c	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Rob Speer	0721707d92	Merge pull request #21 from LuminosoInsight/review-notes Review notes Former-commit-id: `2b8089e2b1`	2015-08-03 14:48:15 -04:00
Andrew Lin	10bddfe09f	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	581dcbcae5	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	a5553676e4	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Andrew Lin	f393086253	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Andrew Lin	be7bc11cad	Merge pull request #20 from LuminosoInsight/cutoff-fix put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `742e2b3374`	2015-07-29 11:43:41 -04:00
Rob Speer	0f0aca8320	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Rob Speer	3892c36b16	update wordlists with cutoff fix Former-commit-id: `eb4b3cad50`	2015-07-28 18:03:12 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	b537f4ecfb	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	8004ecb790	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	0d8bf35fab	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	78324e74eb	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	dcde7c8a28	added updated wordfreq data Former-commit-id: `be29243cec`	2015-07-21 10:32:53 -04:00
Joshua Chin	6f47f76458	bugfix Former-commit-id: `8081145922`	2015-07-21 10:12:56 -04:00
Joshua Chin	0a2f2877af	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	c1f56f5c96	fixed build bug Former-commit-id: `643571c69c`	2015-07-20 16:51:25 -04:00
Joshua Chin	423b2d8443	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	efe7bc3720	unhoisted if statement Former-commit-id: `298d3c1d24`	2015-07-20 11:10:41 -04:00
Joshua Chin	b5a358012b	ninja.py is now pep8 compliant Former-commit-id: `accb7e398c`	2015-07-20 11:06:58 -04:00
Joshua Chin	40ba602c10	made single line docstring single line Former-commit-id: `c70ddf00ea`	2015-07-20 10:29:02 -04:00
Joshua Chin	b787d9104e	updated word_frequency docstring for Chinese Former-commit-id: `01b286e801`	2015-07-20 10:28:11 -04:00
Joshua Chin	8be9d15a80	updated datafiles Former-commit-id: `465afb854c`	2015-07-20 10:05:27 -04:00
Joshua Chin	a3880608b9	fixed build Former-commit-id: `221acf7921`	2015-07-17 17:44:01 -04:00
Rob Speer	176223bd5d	mention the Wikipedia data, and credit Hermit Dave Former-commit-id: `2d1020daac`	2015-07-17 17:09:36 -04:00
Joshua Chin	c3a14a8a09	fixed tokenize_twitter Former-commit-id: `f31f9a1bcd`	2015-07-17 16:37:47 -04:00
Joshua Chin	af73f813be	added cld2 tokenizer comments Former-commit-id: `a44927e98e`	2015-07-17 16:03:33 -04:00
Joshua Chin	5c7e0dd0dd	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	a868c99839	fixed syntax Former-commit-id: `c75c735d8d`	2015-07-17 15:43:24 -04:00
Joshua Chin	8006f80a99	factored out fixing arabic Former-commit-id: `4e3a5263c3`	2015-07-17 15:39:12 -04:00
Joshua Chin	f2546d8d33	renamed tokenize file to tokenize twitter Former-commit-id: `303bd88ba2`	2015-07-17 15:27:26 -04:00
Joshua Chin	d3a5191fb0	created last_tab flag Former-commit-id: `d6519cf736`	2015-07-17 15:19:09 -04:00

... 5 6 7 8 9 ...

622 Commits