wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-25 18:18:53 +00:00

Author	SHA1	Message	Date
Rob Speer	0441a81bbe	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Rob Speer	917ce398a2	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Rob Speer	c08e593234	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Rob Speer	3a8b2c2c81	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Rob Speer	b1d158ab41	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Rob Speer	25e24f9c32	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Rob Speer	a6ef3224a6	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Rob Speer	89763679de	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Rob Speer	ad4b12bee9	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Rob Speer	7a2f2035ab	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Rob Speer	cb5b696ffa	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Rob Speer	4aac7bdd65	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Rob Speer	49bd631632	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Rob Speer	40d6b85d67	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Rob Speer	a3b37f6619	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Rob Speer	6647cf9035	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Rob Speer	6a33b46cfd	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Andrew Lin	581dcbcae5	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	f393086253	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Rob Speer	0f0aca8320	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Rob Speer	4350bc3ed7	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Rob Speer	b537f4ecfb	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	8004ecb790	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	0d8bf35fab	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	78324e74eb	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	6f47f76458	bugfix Former-commit-id: `8081145922`	2015-07-21 10:12:56 -04:00
Joshua Chin	0a2f2877af	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	c1f56f5c96	fixed build bug Former-commit-id: `643571c69c`	2015-07-20 16:51:25 -04:00
Joshua Chin	423b2d8443	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	efe7bc3720	unhoisted if statement Former-commit-id: `298d3c1d24`	2015-07-20 11:10:41 -04:00
Joshua Chin	b5a358012b	ninja.py is now pep8 compliant Former-commit-id: `accb7e398c`	2015-07-20 11:06:58 -04:00
Joshua Chin	a3880608b9	fixed build Former-commit-id: `221acf7921`	2015-07-17 17:44:01 -04:00
Rob Speer	176223bd5d	mention the Wikipedia data, and credit Hermit Dave Former-commit-id: `2d1020daac`	2015-07-17 17:09:36 -04:00
Joshua Chin	c3a14a8a09	fixed tokenize_twitter Former-commit-id: `f31f9a1bcd`	2015-07-17 16:37:47 -04:00
Joshua Chin	af73f813be	added cld2 tokenizer comments Former-commit-id: `a44927e98e`	2015-07-17 16:03:33 -04:00
Joshua Chin	5c7e0dd0dd	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	a868c99839	fixed syntax Former-commit-id: `c75c735d8d`	2015-07-17 15:43:24 -04:00
Joshua Chin	f2546d8d33	renamed tokenize file to tokenize twitter Former-commit-id: `303bd88ba2`	2015-07-17 15:27:26 -04:00
Joshua Chin	d3a5191fb0	created last_tab flag Former-commit-id: `d6519cf736`	2015-07-17 15:19:09 -04:00
Joshua Chin	4b81f8c938	removed uncessary if statement Former-commit-id: `620becb7e8`	2015-07-17 15:14:06 -04:00
Joshua Chin	9812a2a08c	generated freq dict in place Former-commit-id: `d988b1b42e`	2015-07-17 15:13:25 -04:00
Joshua Chin	53dd1e91c5	corrected docstring Former-commit-id: `e37c689031`	2015-07-17 15:12:23 -04:00
Joshua Chin	bb706b65f4	removed unnecessary strip Former-commit-id: `002351bace`	2015-07-17 15:11:28 -04:00
Joshua Chin	919f2f5912	moved last_tab to tokenize_twitter Former-commit-id: `7fc23666a9`	2015-07-17 15:10:17 -04:00
Joshua Chin	4e87458242	removed unused function Former-commit-id: `528285a982`	2015-07-17 15:03:14 -04:00
Joshua Chin	8dd4ffee8a	fixed spacing Former-commit-id: `59d3c72758`	2015-07-17 15:02:34 -04:00
Joshua Chin	09dff0186c	removed unnecessary format Former-commit-id: `10028be212`	2015-07-17 15:01:25 -04:00
Joshua Chin	117e06d5a4	cleaned up BAD_CHAR_RANGE Former-commit-id: `3b368b66dd`	2015-07-17 15:00:59 -04:00
Joshua Chin	cc2f748b05	moved test tokenizers Former-commit-id: `c2d1cdcb31`	2015-07-17 14:58:58 -04:00
Joshua Chin	2180f71296	added docstring and moved to scripts Former-commit-id: `5d26c9f57f`	2015-07-17 14:56:18 -04:00
Joshua Chin	2335369f86	style changes Former-commit-id: `bdc791af8f`	2015-07-17 14:54:32 -04:00
Joshua Chin	6083219fe5	removed bad comment Former-commit-id: `4d5ec57144`	2015-07-17 14:54:09 -04:00
Joshua Chin	4fa4060036	removed unused scripts Former-commit-id: `39f01b0485`	2015-07-17 14:53:18 -04:00
Joshua Chin	631a5f1b71	removed mkdir -p for many cases Former-commit-id: `98a7a8093b`	2015-07-17 14:45:22 -04:00
Joshua Chin	bc4cedf85a	removed TOKENIZE_TWITTER Former-commit-id: `449a656edd`	2015-07-17 14:43:14 -04:00
Joshua Chin	c80943c677	removed TOKENIZE_TWITTER option Former-commit-id: `00e18b7d4b`	2015-07-17 14:40:49 -04:00
Joshua Chin	753d241b6a	more README fixes Former-commit-id: `772c0cddd1`	2015-07-17 14:40:33 -04:00
Joshua Chin	0f92367e3d	fixed README Former-commit-id: `0a085132f4`	2015-07-17 14:35:43 -04:00
Rob Speer	7f9b7bb5d0	update the wordfreq_builder README Former-commit-id: `8633e8c2a9`	2015-07-13 11:58:48 -04:00
Rob Speer	41dba74da2	add docstrings and remove some brackets	2015-07-07 18:22:51 -04:00
Joshua Chin	b0f759d322	Removes mention of Rosette from README	2015-07-07 10:32:16 -04:00
Rob Speer	10c04d116f	add 'twitter' as a final build, and a new build dir The `data/dist` directory is now a convenient place to find the final built files that can be copied into wordfreq.	2015-07-01 17:45:39 -04:00
Rob Speer	37375383e8	cope with occasional Unicode errors in the input	2015-06-30 17:05:40 -04:00
Rob Speer	4771c12814	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Rob Speer	9a2855394d	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Rob Speer	f305679caf	Switch to a centibel scale, add a header to the data	2015-06-22 17:38:13 -04:00
Rob Speer	d16683f2b9	Merge pull request #2 from LuminosoInsight/review-refactor Adds a number of bugfixes and improvements to wordfreq_builder	2015-06-19 15:29:52 -04:00
Rob Speer	5bc1f0c097	restore missing Russian OpenSubtitles data	2015-06-19 12:36:08 -04:00
Joshua Chin	3746af1350	updated freqs_to_dBpack docstring	2015-06-18 10:32:53 -04:00
Joshua Chin	59ce14cdd0	revised read_freqs docstring	2015-06-18 10:28:22 -04:00
Joshua Chin	04bf6aadcc	updated monolingual_tokenize_file docstring, and removed unused argument	2015-06-18 10:20:54 -04:00
Joshua Chin	91dd73a2b5	tokenize_file should ignore lines with unknown languages	2015-06-18 10:18:57 -04:00
Joshua Chin	ffc01c75a0	Fixed CLD2_BAD_CHAR regex	2015-06-18 10:18:00 -04:00
Joshua Chin	8277de2c7f	changed tokenize_file: cld2 return 'un' instead of None if it cannot recognize the language	2015-06-17 14:19:28 -04:00
Joshua Chin	b24f31d30a	tokenize_file: don't join tokens if language is None	2015-06-17 14:18:18 -04:00
Joshua Chin	99d97956e6	automatically closes input file in tokenize_file	2015-06-17 11:42:34 -04:00
Joshua Chin	e50c0c6917	updated test to check number parsing	2015-06-17 11:30:25 -04:00
Joshua Chin	c71e93611b	fixed build process	2015-06-17 11:25:07 -04:00
Joshua Chin	8317ea6d51	updated directory of twitter output	2015-06-16 17:32:58 -04:00
Joshua Chin	da93bc89c2	removed intermediate twitter file rules	2015-06-16 17:28:09 -04:00
Joshua Chin	87f08780c8	improved tokenize_file and updated docstring	2015-06-16 17:27:27 -04:00
Joshua Chin	bea8963a79	renamed pretokenize_twitter to tokenize twitter, and deleted format_twitter	2015-06-16 17:26:52 -04:00
Joshua Chin	aeedb408b7	fixed bugs and removed unused code	2015-06-16 17:25:06 -04:00
Joshua Chin	64644d8ede	changed tokenizer to only strip t.co urls	2015-06-16 16:11:31 -04:00
Joshua Chin	b649d45e61	Added codepoints U+10FFFE and U+10FFFF to CLD2_BAD_CHAR_RANGE	2015-06-16 16:03:58 -04:00
Joshua Chin	a200a0a689	added tests for the tokenizer and language recognizer	2015-06-16 16:00:14 -04:00
Joshua Chin	1cf7e3d2b9	added pycld2 dependency	2015-06-16 15:06:22 -04:00
Joshua Chin	297d981e20	Replaced Rosette with cld2 language recognizer and wordfreq tokenizer	2015-06-16 14:45:49 -04:00
Rob Speer	b78d8ca3ee	ninja2dot: make a graph of the build process	2015-06-15 13:14:32 -04:00
Rob Speer	56d447a825	Reorganize and document some functions	2015-06-15 12:40:31 -04:00
Rob Speer	3d28491f4d	okay, apparently you can't mix code blocks and bullets	2015-06-01 11:39:42 -04:00
Rob Speer	d202474763	is this indented enough for you, markdown	2015-06-01 11:38:10 -04:00
Rob Speer	9927a8c414	add a README	2015-06-01 11:37:19 -04:00
Rob Speer	cbe3513e08	Tokenize Japanese consistently with MeCab	2015-05-27 17:44:58 -04:00
Rob Speer	536c15fbdb	give mecab a larger buffer	2015-05-26 19:34:46 -04:00
Rob Speer	5de81c7111	fix build rules for Japanese Wikipedia	2015-05-26 18:08:57 -04:00
Rob Speer	3d5b3d47e8	fix version in config.py	2015-05-26 18:08:46 -04:00
Rob Speer	ffd352f148	correct a Leeds bug; add some comments to rules.ninja	2015-05-26 18:08:04 -04:00
Rob Speer	50ff85ce19	add Google Books data for English	2015-05-11 18:44:28 -04:00
Rob Speer	c707b32345	move some functions to the wordfreq package	2015-05-11 17:02:52 -04:00

1 2 3 4

175 Commits