wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Robyn Speer	64b0b76ee1	actually fix logic of apostrophe-fixing Former-commit-id: `715361ca0d`	2015-09-08 13:50:34 -04:00
Robyn Speer	d6d2eac920	fix logic of apostrophe-fixing Former-commit-id: `c4c1af8213`	2015-09-08 13:47:58 -04:00
Robyn Speer	523806d6db	fix '--language' option definition Former-commit-id: `912171f8e7`	2015-09-08 13:27:20 -04:00
Robyn Speer	099d90b700	Avoid Chinese tokenizer when building Former-commit-id: `77a9b5c55b`	2015-09-08 12:59:03 -04:00
Robyn Speer	3fa14ded28	language-specific frequency reading; fix 't in English Former-commit-id: `9071defb33`	2015-09-08 12:49:21 -04:00
Robyn Speer	1b35ff6b4c	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py Former-commit-id: `20f2828d0a`	2015-09-08 12:29:00 -04:00
Robyn Speer	319c3abaab	WIP: fix apostrophe trimming Former-commit-id: `e39d345c4b`	2015-09-08 12:28:28 -04:00
Robyn Speer	a4554fb87c	tokenize Chinese using jieba and our own frequencies Former-commit-id: `2327f2e4d6`	2015-09-05 03:16:56 -04:00
Robyn Speer	7d1c2e72e4	WIP: Traditional Chinese Former-commit-id: `7906a671ea`	2015-09-04 18:52:37 -04:00
Robyn Speer	5b9b2d2d02	add Polish and Swedish, which have sufficient data Former-commit-id: `447d7e5134`	2015-09-04 17:10:40 -04:00
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	14136d2a01	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Robyn Speer	f168c37417	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Robyn Speer	76c751652e	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Robyn Speer	3446a393c5	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Robyn Speer	f66d03b1b9	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Robyn Speer	247d7c6579	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Robyn Speer	af29fc4f88	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Robyn Speer	3a140ee02f	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Robyn Speer	b22a4b0f02	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Robyn Speer	8637aaef9e	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Robyn Speer	4ec128adae	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Andrew Lin	77610f57e1	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	0711fb3c43	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Robyn Speer	e9dd253f1d	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Robyn Speer	3ff0f30218	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Robyn Speer	33e0493fd5	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	292fc96142	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	d629e8b6cc	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	f9742c94ca	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	474ae0da35	bugfix Former-commit-id: `8081145922`	2015-07-21 10:12:56 -04:00
Joshua Chin	34504eed80	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	61a03b87bc	fixed build bug Former-commit-id: `643571c69c`	2015-07-20 16:51:25 -04:00
Joshua Chin	af8050f1b8	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	675a02ac11	unhoisted if statement Former-commit-id: `298d3c1d24`	2015-07-20 11:10:41 -04:00
Joshua Chin	98cbef4ecf	ninja.py is now pep8 compliant Former-commit-id: `accb7e398c`	2015-07-20 11:06:58 -04:00
Joshua Chin	44669bd3a9	fixed build Former-commit-id: `221acf7921`	2015-07-17 17:44:01 -04:00
Robyn Speer	ea2c6adbc4	mention the Wikipedia data, and credit Hermit Dave Former-commit-id: `2d1020daac`	2015-07-17 17:09:36 -04:00
Joshua Chin	ec871bb6ca	fixed tokenize_twitter Former-commit-id: `f31f9a1bcd`	2015-07-17 16:37:47 -04:00
Joshua Chin	71ff0c62d6	added cld2 tokenizer comments Former-commit-id: `a44927e98e`	2015-07-17 16:03:33 -04:00
Joshua Chin	c2f3928433	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	d283183743	fixed syntax Former-commit-id: `c75c735d8d`	2015-07-17 15:43:24 -04:00
Joshua Chin	3962b475c1	renamed tokenize file to tokenize twitter Former-commit-id: `303bd88ba2`	2015-07-17 15:27:26 -04:00
Joshua Chin	2f73cc535c	created last_tab flag Former-commit-id: `d6519cf736`	2015-07-17 15:19:09 -04:00
Joshua Chin	4117480a0e	removed uncessary if statement Former-commit-id: `620becb7e8`	2015-07-17 15:14:06 -04:00

1 2 3

135 Commits