wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Rob Speer	fc93c8dc9c	add tests for Turkish	2015-09-04 17:00:05 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	8196643509	add more citations	2015-09-04 15:57:40 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	a47497c908	update data files (without the CLD2 fix yet)	2015-09-04 14:58:20 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	81bbe663fb	update README with additional SUBTLEX support	2015-09-04 13:23:33 -04:00
Rob Speer	34474939f2	add more SUBTLEX and fix its build rules	2015-09-04 12:37:35 -04:00
Rob Speer	c11e3b7a9d	update the data	2015-09-04 02:07:50 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00
Rob Speer	d9a1c34d00	expand list of sources and supported languages	2015-09-04 01:03:36 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	45d871a815	Merge branch 'add-subtlex' into greek-and-turkish	2015-09-03 23:26:14 -04:00
Rob Speer	40d82541ba	refer to merge_freqs command correctly	2015-09-03 23:25:46 -04:00
Rob Speer	a3daba81eb	expand Greek and enable Turkish in config	2015-09-03 23:23:31 -04:00
Rob Speer	e6a2886a66	add SUBTLEX to the readme	2015-09-03 18:56:56 -04:00
Rob Speer	2d58ba94f2	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now.	2015-09-03 18:13:13 -04:00
Andrew Lin	e43b5ebf7b	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README	2015-08-28 17:59:17 -04:00
Rob Speer	00a2812907	fix heading	2015-08-28 17:49:38 -04:00
Rob Speer	93f44683c5	fix list formatting	2015-08-28 17:49:07 -04:00
Rob Speer	5def3a7897	update the build diagram and its script	2015-08-28 17:47:04 -04:00
Rob Speer	2370287539	improve README with function documentation and examples	2015-08-28 17:45:50 -04:00
Andrew Lin	e6d9b36203	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer	2015-08-27 11:56:19 -04:00
Rob Speer	b952676679	update data files	2015-08-27 03:58:54 -04:00
Rob Speer	d5fcf4407e	copyedit regex comments	2015-08-26 17:04:56 -04:00
Rob Speer	34375958ef	fix typo in docstring	2015-08-26 16:24:35 -04:00
Rob Speer	c4a2594217	fix URL expression	2015-08-26 15:00:46 -04:00
Rob Speer	f7babea352	correct the simple_tokenize docstring	2015-08-26 13:54:50 -04:00
Rob Speer	01b6403ef4	refactor the token expression	2015-08-26 13:40:47 -04:00
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	94467a6563	remove regex files that are no longer needed	2015-08-26 11:48:11 -04:00
Rob Speer	694c28d5e4	bump to version 1.1	2015-08-25 17:44:52 -04:00
Rob Speer	573dd1ec79	update the README	2015-08-25 17:44:34 -04:00
Rob Speer	353b8045da	updated data	2015-08-25 17:16:03 -04:00
Rob Speer	5a1fc00aaa	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent.	2015-08-25 12:41:48 -04:00
Rob Speer	a8e7c29068	exclude 'extenders' from the start of the token	2015-08-25 12:33:12 -04:00
Rob Speer	0d600bdf27	update frequency lists	2015-08-25 11:43:59 -04:00
Rob Speer	8f3c9f576c	Exclude math and modifier symbols as tokens	2015-08-25 11:43:22 -04:00
Rob Speer	de73888a76	use better regexes in wordfreq_builder tokenizer	2015-08-24 19:05:46 -04:00
Rob Speer	554455699d	also NFKC-normalize Japanese input	2015-08-24 18:13:03 -04:00
Rob Speer	1d055edc1c	only NFKC-normalize in Arabic	2015-08-24 17:55:17 -04:00
Rob Speer	140ca6c050	remove Hangul fillers that confuse cld2	2015-08-24 17:11:18 -04:00
Rob Speer	102bc715ae	remove obsolete gen_regex.py	2015-08-24 17:11:18 -04:00
Rob Speer	95998205ad	Use the regex implementation of Unicode segmentation	2015-08-24 17:11:08 -04:00
Rob Speer	2b8089e2b1	Merge pull request #21 from LuminosoInsight/review-notes Review notes	2015-08-03 14:48:15 -04:00
Andrew Lin	41e1dd41d8	Document the NFKC-normalized ligature in the Arabic test.	2015-08-03 11:09:44 -04:00
Andrew Lin	6d40912ef9	Stylistic cleanups to word_counts.py.	2015-07-31 19:26:18 -04:00
Andrew Lin	66c69e6fac	Switch to more explanatory Unicode escapes when testing NFKC normalization.	2015-07-31 19:23:42 -04:00
Andrew Lin	53621c34df	Remove redundant reference to wikipedia in builder README.	2015-07-31 19:12:59 -04:00

1 2 3 4 5 ...

349 Commits