wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 02:28:50 +00:00

Author	SHA1	Message	Date
Rob Speer	c11e3b7a9d	update the data	2015-09-04 02:07:50 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00
Rob Speer	d9a1c34d00	expand list of sources and supported languages	2015-09-04 01:03:36 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	45d871a815	Merge branch 'add-subtlex' into greek-and-turkish	2015-09-03 23:26:14 -04:00
Rob Speer	40d82541ba	refer to merge_freqs command correctly	2015-09-03 23:25:46 -04:00
Rob Speer	a3daba81eb	expand Greek and enable Turkish in config	2015-09-03 23:23:31 -04:00
Rob Speer	e6a2886a66	add SUBTLEX to the readme	2015-09-03 18:56:56 -04:00
Rob Speer	2d58ba94f2	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now.	2015-09-03 18:13:13 -04:00
Andrew Lin	e43b5ebf7b	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README	2015-08-28 17:59:17 -04:00
Rob Speer	00a2812907	fix heading	2015-08-28 17:49:38 -04:00
Rob Speer	93f44683c5	fix list formatting	2015-08-28 17:49:07 -04:00
Rob Speer	5def3a7897	update the build diagram and its script	2015-08-28 17:47:04 -04:00
Rob Speer	2370287539	improve README with function documentation and examples	2015-08-28 17:45:50 -04:00
Andrew Lin	e6d9b36203	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer	2015-08-27 11:56:19 -04:00
Rob Speer	b952676679	update data files	2015-08-27 03:58:54 -04:00
Rob Speer	d5fcf4407e	copyedit regex comments	2015-08-26 17:04:56 -04:00
Rob Speer	34375958ef	fix typo in docstring	2015-08-26 16:24:35 -04:00
Rob Speer	c4a2594217	fix URL expression	2015-08-26 15:00:46 -04:00
Rob Speer	f7babea352	correct the simple_tokenize docstring	2015-08-26 13:54:50 -04:00
Rob Speer	01b6403ef4	refactor the token expression	2015-08-26 13:40:47 -04:00
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	94467a6563	remove regex files that are no longer needed	2015-08-26 11:48:11 -04:00
Rob Speer	694c28d5e4	bump to version 1.1	2015-08-25 17:44:52 -04:00
Rob Speer	573dd1ec79	update the README	2015-08-25 17:44:34 -04:00
Rob Speer	353b8045da	updated data	2015-08-25 17:16:03 -04:00
Rob Speer	5a1fc00aaa	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent.	2015-08-25 12:41:48 -04:00
Rob Speer	a8e7c29068	exclude 'extenders' from the start of the token	2015-08-25 12:33:12 -04:00
Rob Speer	0d600bdf27	update frequency lists	2015-08-25 11:43:59 -04:00
Rob Speer	8f3c9f576c	Exclude math and modifier symbols as tokens	2015-08-25 11:43:22 -04:00
Rob Speer	de73888a76	use better regexes in wordfreq_builder tokenizer	2015-08-24 19:05:46 -04:00
Rob Speer	554455699d	also NFKC-normalize Japanese input	2015-08-24 18:13:03 -04:00
Rob Speer	1d055edc1c	only NFKC-normalize in Arabic	2015-08-24 17:55:17 -04:00
Rob Speer	140ca6c050	remove Hangul fillers that confuse cld2	2015-08-24 17:11:18 -04:00
Rob Speer	102bc715ae	remove obsolete gen_regex.py	2015-08-24 17:11:18 -04:00
Rob Speer	95998205ad	Use the regex implementation of Unicode segmentation	2015-08-24 17:11:08 -04:00
Rob Speer	2b8089e2b1	Merge pull request #21 from LuminosoInsight/review-notes Review notes	2015-08-03 14:48:15 -04:00
Andrew Lin	41e1dd41d8	Document the NFKC-normalized ligature in the Arabic test.	2015-08-03 11:09:44 -04:00
Andrew Lin	6d40912ef9	Stylistic cleanups to word_counts.py.	2015-07-31 19:26:18 -04:00
Andrew Lin	66c69e6fac	Switch to more explanatory Unicode escapes when testing NFKC normalization.	2015-07-31 19:23:42 -04:00
Andrew Lin	53621c34df	Remove redundant reference to wikipedia in builder README.	2015-07-31 19:12:59 -04:00
Andrew Lin	742e2b3374	Merge pull request #20 from LuminosoInsight/cutoff-fix put back the freqs_to_cBpack cutoff; prepare for 1.0	2015-07-29 11:43:41 -04:00
Rob Speer	e9f9c94e36	Don't use the file-reading cutoff when writing centibels	2015-07-28 18:45:26 -04:00
Rob Speer	eb4b3cad50	update wordlists with cutoff fix	2015-07-28 18:03:12 -04:00
Rob Speer	c5708b24e4	put back the freqs_to_cBpack cutoff; prepare for 1.0	2015-07-28 18:01:12 -04:00
Rob Speer	32102ba3c2	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17	2015-07-22 15:09:00 -04:00
Joshua Chin	93cd902899	updated read_freqs docs	2015-07-22 10:06:16 -04:00
Joshua Chin	4fe9d110e1	fixed style	2015-07-22 10:05:11 -04:00
Joshua Chin	6453d864c4	reordered command line args	2015-07-22 10:04:14 -04:00
Joshua Chin	be29243cec	added updated wordfreq data	2015-07-21 10:32:53 -04:00

1 2 3 4 5 ...

340 Commits