wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 09:51:38 +00:00

Author	SHA1	Message	Date
Rob Speer	bc323eccaf	update comments in wordfreq_builder.config; remove unused 'version'	2015-09-08 16:15:29 -04:00
Rob Speer	0ab23f8a28	sort Jieba wordlists consistently; update data files	2015-09-08 16:09:53 -04:00
Rob Speer	bc8ebd23e9	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient.	2015-09-08 14:46:04 -04:00
Rob Speer	715361ca0d	actually fix logic of apostrophe-fixing	2015-09-08 13:50:34 -04:00
Rob Speer	c4c1af8213	fix logic of apostrophe-fixing	2015-09-08 13:47:58 -04:00
Rob Speer	912171f8e7	fix '--language' option definition	2015-09-08 13:27:20 -04:00
Rob Speer	77a9b5c55b	Avoid Chinese tokenizer when building	2015-09-08 12:59:03 -04:00
Rob Speer	9071defb33	language-specific frequency reading; fix 't in English	2015-09-08 12:49:21 -04:00
Rob Speer	20f2828d0a	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py	2015-09-08 12:29:00 -04:00
Rob Speer	e39d345c4b	WIP: fix apostrophe trimming	2015-09-08 12:28:28 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	447d7e5134	add Polish and Swedish, which have sufficient data	2015-09-04 17:10:40 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	34474939f2	add more SUBTLEX and fix its build rules	2015-09-04 12:37:35 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	45d871a815	Merge branch 'add-subtlex' into greek-and-turkish	2015-09-03 23:26:14 -04:00
Rob Speer	40d82541ba	refer to merge_freqs command correctly	2015-09-03 23:25:46 -04:00
Rob Speer	a3daba81eb	expand Greek and enable Turkish in config	2015-09-03 23:23:31 -04:00
Rob Speer	2d58ba94f2	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now.	2015-09-03 18:13:13 -04:00
Rob Speer	5def3a7897	update the build diagram and its script	2015-08-28 17:47:04 -04:00
Rob Speer	c4a2594217	fix URL expression	2015-08-26 15:00:46 -04:00
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	5a1fc00aaa	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent.	2015-08-25 12:41:48 -04:00
Rob Speer	de73888a76	use better regexes in wordfreq_builder tokenizer	2015-08-24 19:05:46 -04:00
Rob Speer	140ca6c050	remove Hangul fillers that confuse cld2	2015-08-24 17:11:18 -04:00
Andrew Lin	6d40912ef9	Stylistic cleanups to word_counts.py.	2015-07-31 19:26:18 -04:00
Andrew Lin	53621c34df	Remove redundant reference to wikipedia in builder README.	2015-07-31 19:12:59 -04:00
Rob Speer	e9f9c94e36	Don't use the file-reading cutoff when writing centibels	2015-07-28 18:45:26 -04:00
Rob Speer	c5708b24e4	put back the freqs_to_cBpack cutoff; prepare for 1.0	2015-07-28 18:01:12 -04:00
Rob Speer	32102ba3c2	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17	2015-07-22 15:09:00 -04:00
Joshua Chin	93cd902899	updated read_freqs docs	2015-07-22 10:06:16 -04:00
Joshua Chin	4fe9d110e1	fixed style	2015-07-22 10:05:11 -04:00
Joshua Chin	6453d864c4	reordered command line args	2015-07-22 10:04:14 -04:00
Joshua Chin	8081145922	bugfix	2015-07-21 10:12:56 -04:00
Joshua Chin	c5f82ecac1	fixed rules.ninja	2015-07-20 17:20:29 -04:00
Joshua Chin	643571c69c	fixed build bug	2015-07-20 16:51:25 -04:00
Joshua Chin	173278fdd3	ensure removal of tatweels (hopefully)	2015-07-20 16:48:36 -04:00
Joshua Chin	298d3c1d24	unhoisted if statement	2015-07-20 11:10:41 -04:00
Joshua Chin	accb7e398c	ninja.py is now pep8 compliant	2015-07-20 11:06:58 -04:00
Joshua Chin	221acf7921	fixed build	2015-07-17 17:44:01 -04:00
Rob Speer	2d1020daac	mention the Wikipedia data, and credit Hermit Dave	2015-07-17 17:09:36 -04:00
Joshua Chin	f31f9a1bcd	fixed tokenize_twitter	2015-07-17 16:37:47 -04:00
Joshua Chin	a44927e98e	added cld2 tokenizer comments	2015-07-17 16:03:33 -04:00
Joshua Chin	11a1c51321	fix arabic tokens	2015-07-17 15:52:12 -04:00
Joshua Chin	c75c735d8d	fixed syntax	2015-07-17 15:43:24 -04:00

1 2 3

138 Commits