wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	44c655d9a6	improve README with function documentation and examples Former-commit-id: `2370287539`	2015-08-28 17:45:50 -04:00
Andrew Lin	9fedede771	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer Former-commit-id: `e6d9b36203`	2015-08-27 11:56:19 -04:00
Robyn Speer	4edfab23ef	update data files Former-commit-id: `b952676679`	2015-08-27 03:58:54 -04:00
Robyn Speer	2c688b8238	copyedit regex comments Former-commit-id: `d5fcf4407e`	2015-08-26 17:04:56 -04:00
Robyn Speer	0b5d2cdca9	fix typo in docstring Former-commit-id: `34375958ef`	2015-08-26 16:24:35 -04:00
Robyn Speer	af29fc4f88	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Robyn Speer	e463397edf	correct the simple_tokenize docstring Former-commit-id: `f7babea352`	2015-08-26 13:54:50 -04:00
Robyn Speer	7fa449729b	refactor the token expression Former-commit-id: `01b6403ef4`	2015-08-26 13:40:47 -04:00
Robyn Speer	3a140ee02f	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Robyn Speer	769d8c627c	remove regex files that are no longer needed Former-commit-id: `94467a6563`	2015-08-26 11:48:11 -04:00
Robyn Speer	6f10e71d29	bump to version 1.1 Former-commit-id: `694c28d5e4`	2015-08-25 17:44:52 -04:00
Robyn Speer	a3a3180bb9	update the README Former-commit-id: `573dd1ec79`	2015-08-25 17:44:34 -04:00
Robyn Speer	e3658e0e42	updated data Former-commit-id: `353b8045da`	2015-08-25 17:16:03 -04:00
Robyn Speer	b22a4b0f02	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Robyn Speer	0b282c5055	exclude 'extenders' from the start of the token Former-commit-id: `a8e7c29068`	2015-08-25 12:33:12 -04:00
Robyn Speer	4801b0d876	update frequency lists Former-commit-id: `0d600bdf27`	2015-08-25 11:43:59 -04:00
Robyn Speer	070c89c00c	Exclude math and modifier symbols as tokens Former-commit-id: `8f3c9f576c`	2015-08-25 11:43:22 -04:00
Robyn Speer	8637aaef9e	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Robyn Speer	7bdfb74720	also NFKC-normalize Japanese input Former-commit-id: `554455699d`	2015-08-24 18:13:03 -04:00
Robyn Speer	13096b26bd	only NFKC-normalize in Arabic Former-commit-id: `1d055edc1c`	2015-08-24 17:55:17 -04:00
Robyn Speer	4ec128adae	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Robyn Speer	3674d35501	remove obsolete gen_regex.py Former-commit-id: `102bc715ae`	2015-08-24 17:11:18 -04:00
Robyn Speer	8795525372	Use the regex implementation of Unicode segmentation Former-commit-id: `95998205ad`	2015-08-24 17:11:08 -04:00
Robyn Speer	e15fc14b8e	Merge pull request #21 from LuminosoInsight/review-notes Review notes Former-commit-id: `2b8089e2b1`	2015-08-03 14:48:15 -04:00
Andrew Lin	e88cf3fdaf	Document the NFKC-normalized ligature in the Arabic test. Former-commit-id: `41e1dd41d8`	2015-08-03 11:09:44 -04:00
Andrew Lin	77610f57e1	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	b0fac15f98	Switch to more explanatory Unicode escapes when testing NFKC normalization. Former-commit-id: `66c69e6fac`	2015-07-31 19:23:42 -04:00
Andrew Lin	0711fb3c43	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Andrew Lin	ba565ee838	Merge pull request #20 from LuminosoInsight/cutoff-fix put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `742e2b3374`	2015-07-29 11:43:41 -04:00
Robyn Speer	e9dd253f1d	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Robyn Speer	0a032dfa97	update wordlists with cutoff fix Former-commit-id: `eb4b3cad50`	2015-07-28 18:03:12 -04:00
Robyn Speer	3ff0f30218	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Robyn Speer	33e0493fd5	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	292fc96142	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	d629e8b6cc	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	f9742c94ca	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	b19bba38ad	added updated wordfreq data Former-commit-id: `be29243cec`	2015-07-21 10:32:53 -04:00
Joshua Chin	474ae0da35	bugfix Former-commit-id: `8081145922`	2015-07-21 10:12:56 -04:00
Joshua Chin	34504eed80	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	61a03b87bc	fixed build bug Former-commit-id: `643571c69c`	2015-07-20 16:51:25 -04:00
Joshua Chin	af8050f1b8	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	675a02ac11	unhoisted if statement Former-commit-id: `298d3c1d24`	2015-07-20 11:10:41 -04:00
Joshua Chin	98cbef4ecf	ninja.py is now pep8 compliant Former-commit-id: `accb7e398c`	2015-07-20 11:06:58 -04:00
Joshua Chin	3b6b8d3ab1	made single line docstring single line Former-commit-id: `c70ddf00ea`	2015-07-20 10:29:02 -04:00
Joshua Chin	532b953839	updated word_frequency docstring for Chinese Former-commit-id: `01b286e801`	2015-07-20 10:28:11 -04:00
Joshua Chin	360f66bbaf	updated datafiles Former-commit-id: `465afb854c`	2015-07-20 10:05:27 -04:00
Joshua Chin	44669bd3a9	fixed build Former-commit-id: `221acf7921`	2015-07-17 17:44:01 -04:00
Robyn Speer	ea2c6adbc4	mention the Wikipedia data, and credit Hermit Dave Former-commit-id: `2d1020daac`	2015-07-17 17:09:36 -04:00
Joshua Chin	ec871bb6ca	fixed tokenize_twitter Former-commit-id: `f31f9a1bcd`	2015-07-17 16:37:47 -04:00
Joshua Chin	71ff0c62d6	added cld2 tokenizer comments Former-commit-id: `a44927e98e`	2015-07-17 16:03:33 -04:00

1 2 3 4 5 ...

427 Commits