wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 18:01:38 +00:00

Author	SHA1	Message	Date
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	5a1fc00aaa	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent.	2015-08-25 12:41:48 -04:00
Rob Speer	de73888a76	use better regexes in wordfreq_builder tokenizer	2015-08-24 19:05:46 -04:00
Rob Speer	140ca6c050	remove Hangul fillers that confuse cld2	2015-08-24 17:11:18 -04:00
Andrew Lin	6d40912ef9	Stylistic cleanups to word_counts.py.	2015-07-31 19:26:18 -04:00
Andrew Lin	53621c34df	Remove redundant reference to wikipedia in builder README.	2015-07-31 19:12:59 -04:00
Rob Speer	e9f9c94e36	Don't use the file-reading cutoff when writing centibels	2015-07-28 18:45:26 -04:00
Rob Speer	c5708b24e4	put back the freqs_to_cBpack cutoff; prepare for 1.0	2015-07-28 18:01:12 -04:00
Rob Speer	32102ba3c2	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17	2015-07-22 15:09:00 -04:00
Joshua Chin	93cd902899	updated read_freqs docs	2015-07-22 10:06:16 -04:00
Joshua Chin	4fe9d110e1	fixed style	2015-07-22 10:05:11 -04:00
Joshua Chin	6453d864c4	reordered command line args	2015-07-22 10:04:14 -04:00
Joshua Chin	8081145922	bugfix	2015-07-21 10:12:56 -04:00
Joshua Chin	c5f82ecac1	fixed rules.ninja	2015-07-20 17:20:29 -04:00
Joshua Chin	643571c69c	fixed build bug	2015-07-20 16:51:25 -04:00
Joshua Chin	173278fdd3	ensure removal of tatweels (hopefully)	2015-07-20 16:48:36 -04:00
Joshua Chin	298d3c1d24	unhoisted if statement	2015-07-20 11:10:41 -04:00
Joshua Chin	accb7e398c	ninja.py is now pep8 compliant	2015-07-20 11:06:58 -04:00
Joshua Chin	221acf7921	fixed build	2015-07-17 17:44:01 -04:00
Rob Speer	2d1020daac	mention the Wikipedia data, and credit Hermit Dave	2015-07-17 17:09:36 -04:00
Joshua Chin	f31f9a1bcd	fixed tokenize_twitter	2015-07-17 16:37:47 -04:00
Joshua Chin	a44927e98e	added cld2 tokenizer comments	2015-07-17 16:03:33 -04:00
Joshua Chin	11a1c51321	fix arabic tokens	2015-07-17 15:52:12 -04:00
Joshua Chin	c75c735d8d	fixed syntax	2015-07-17 15:43:24 -04:00
Joshua Chin	303bd88ba2	renamed tokenize file to tokenize twitter	2015-07-17 15:27:26 -04:00
Joshua Chin	d6519cf736	created last_tab flag	2015-07-17 15:19:09 -04:00
Joshua Chin	620becb7e8	removed uncessary if statement	2015-07-17 15:14:06 -04:00
Joshua Chin	d988b1b42e	generated freq dict in place	2015-07-17 15:13:25 -04:00
Joshua Chin	e37c689031	corrected docstring	2015-07-17 15:12:23 -04:00
Joshua Chin	002351bace	removed unnecessary strip	2015-07-17 15:11:28 -04:00
Joshua Chin	7fc23666a9	moved last_tab to tokenize_twitter	2015-07-17 15:10:17 -04:00
Joshua Chin	528285a982	removed unused function	2015-07-17 15:03:14 -04:00
Joshua Chin	59d3c72758	fixed spacing	2015-07-17 15:02:34 -04:00
Joshua Chin	10028be212	removed unnecessary format	2015-07-17 15:01:25 -04:00
Joshua Chin	3b368b66dd	cleaned up BAD_CHAR_RANGE	2015-07-17 15:00:59 -04:00
Joshua Chin	c2d1cdcb31	moved test tokenizers	2015-07-17 14:58:58 -04:00
Joshua Chin	5d26c9f57f	added docstring and moved to scripts	2015-07-17 14:56:18 -04:00
Joshua Chin	bdc791af8f	style changes	2015-07-17 14:54:32 -04:00
Joshua Chin	4d5ec57144	removed bad comment	2015-07-17 14:54:09 -04:00
Joshua Chin	39f01b0485	removed unused scripts	2015-07-17 14:53:18 -04:00
Joshua Chin	98a7a8093b	removed mkdir -p for many cases	2015-07-17 14:45:22 -04:00
Joshua Chin	449a656edd	removed TOKENIZE_TWITTER	2015-07-17 14:43:14 -04:00
Joshua Chin	00e18b7d4b	removed TOKENIZE_TWITTER option	2015-07-17 14:40:49 -04:00
Joshua Chin	772c0cddd1	more README fixes	2015-07-17 14:40:33 -04:00
Joshua Chin	0a085132f4	fixed README	2015-07-17 14:35:43 -04:00
Rob Speer	8633e8c2a9	update the wordfreq_builder README	2015-07-13 11:58:48 -04:00
Rob Speer	41dba74da2	add docstrings and remove some brackets	2015-07-07 18:22:51 -04:00
Joshua Chin	b0f759d322	Removes mention of Rosette from README	2015-07-07 10:32:16 -04:00
Rob Speer	10c04d116f	add 'twitter' as a final build, and a new build dir The `data/dist` directory is now a convenient place to find the final built files that can be copied into wordfreq.	2015-07-01 17:45:39 -04:00
Rob Speer	37375383e8	cope with occasional Unicode errors in the input	2015-06-30 17:05:40 -04:00
Rob Speer	4771c12814	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Rob Speer	9a2855394d	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Rob Speer	f305679caf	Switch to a centibel scale, add a header to the data	2015-06-22 17:38:13 -04:00
Rob Speer	d16683f2b9	Merge pull request #2 from LuminosoInsight/review-refactor Adds a number of bugfixes and improvements to wordfreq_builder	2015-06-19 15:29:52 -04:00
Rob Speer	5bc1f0c097	restore missing Russian OpenSubtitles data	2015-06-19 12:36:08 -04:00
Joshua Chin	3746af1350	updated freqs_to_dBpack docstring	2015-06-18 10:32:53 -04:00
Joshua Chin	59ce14cdd0	revised read_freqs docstring	2015-06-18 10:28:22 -04:00
Joshua Chin	04bf6aadcc	updated monolingual_tokenize_file docstring, and removed unused argument	2015-06-18 10:20:54 -04:00
Joshua Chin	91dd73a2b5	tokenize_file should ignore lines with unknown languages	2015-06-18 10:18:57 -04:00
Joshua Chin	ffc01c75a0	Fixed CLD2_BAD_CHAR regex	2015-06-18 10:18:00 -04:00
Joshua Chin	8277de2c7f	changed tokenize_file: cld2 return 'un' instead of None if it cannot recognize the language	2015-06-17 14:19:28 -04:00
Joshua Chin	b24f31d30a	tokenize_file: don't join tokens if language is None	2015-06-17 14:18:18 -04:00
Joshua Chin	99d97956e6	automatically closes input file in tokenize_file	2015-06-17 11:42:34 -04:00
Joshua Chin	e50c0c6917	updated test to check number parsing	2015-06-17 11:30:25 -04:00
Joshua Chin	c71e93611b	fixed build process	2015-06-17 11:25:07 -04:00
Joshua Chin	8317ea6d51	updated directory of twitter output	2015-06-16 17:32:58 -04:00
Joshua Chin	da93bc89c2	removed intermediate twitter file rules	2015-06-16 17:28:09 -04:00
Joshua Chin	87f08780c8	improved tokenize_file and updated docstring	2015-06-16 17:27:27 -04:00
Joshua Chin	bea8963a79	renamed pretokenize_twitter to tokenize twitter, and deleted format_twitter	2015-06-16 17:26:52 -04:00
Joshua Chin	aeedb408b7	fixed bugs and removed unused code	2015-06-16 17:25:06 -04:00
Joshua Chin	64644d8ede	changed tokenizer to only strip t.co urls	2015-06-16 16:11:31 -04:00
Joshua Chin	b649d45e61	Added codepoints U+10FFFE and U+10FFFF to CLD2_BAD_CHAR_RANGE	2015-06-16 16:03:58 -04:00
Joshua Chin	a200a0a689	added tests for the tokenizer and language recognizer	2015-06-16 16:00:14 -04:00
Joshua Chin	1cf7e3d2b9	added pycld2 dependency	2015-06-16 15:06:22 -04:00
Joshua Chin	297d981e20	Replaced Rosette with cld2 language recognizer and wordfreq tokenizer	2015-06-16 14:45:49 -04:00
Rob Speer	b78d8ca3ee	ninja2dot: make a graph of the build process	2015-06-15 13:14:32 -04:00
Rob Speer	56d447a825	Reorganize and document some functions	2015-06-15 12:40:31 -04:00
Rob Speer	3d28491f4d	okay, apparently you can't mix code blocks and bullets	2015-06-01 11:39:42 -04:00
Rob Speer	d202474763	is this indented enough for you, markdown	2015-06-01 11:38:10 -04:00
Rob Speer	9927a8c414	add a README	2015-06-01 11:37:19 -04:00
Rob Speer	cbe3513e08	Tokenize Japanese consistently with MeCab	2015-05-27 17:44:58 -04:00
Rob Speer	536c15fbdb	give mecab a larger buffer	2015-05-26 19:34:46 -04:00
Rob Speer	5de81c7111	fix build rules for Japanese Wikipedia	2015-05-26 18:08:57 -04:00
Rob Speer	3d5b3d47e8	fix version in config.py	2015-05-26 18:08:46 -04:00
Rob Speer	ffd352f148	correct a Leeds bug; add some comments to rules.ninja	2015-05-26 18:08:04 -04:00
Rob Speer	50ff85ce19	add Google Books data for English	2015-05-11 18:44:28 -04:00
Rob Speer	c707b32345	move some functions to the wordfreq package	2015-05-11 17:02:52 -04:00
Rob Speer	d0d777ed91	use a more general-purpose tokenizer, not 'retokenize'	2015-05-08 12:40:14 -04:00
Rob Speer	35128a94ca	build.ninja knows about its own dependencies	2015-05-08 12:40:06 -04:00
Rob Speer	d6cc90792f	Makefile should only be needed for bootstrapping Ninja	2015-05-08 12:39:31 -04:00
Rob Speer	2f14417bcf	limit final builds to languages with >= 2 sources	2015-05-07 23:59:04 -04:00
Rob Speer	1b7a2b9d0b	fix dependency	2015-05-07 23:55:57 -04:00
Rob Speer	abb0e059c8	a reasonably complete build process	2015-05-07 19:38:33 -04:00
Rob Speer	02d8b32119	process leeds and opensubtitles	2015-05-07 17:07:33 -04:00
Rob Speer	7e238cf547	abstract how we define build rules a bit	2015-05-07 16:59:28 -04:00
Rob Speer	d2f9c60776	WIP on more build steps	2015-05-07 16:49:53 -04:00
Rob Speer	16928ed182	add rules to count wikipedia tokens	2015-05-05 15:21:24 -04:00
Rob Speer	bd579e2319	fix the 'count' ninja rule	2015-05-05 14:06:13 -04:00
Rob Speer	5787b6bb73	add and adjust some build steps - more build steps for Wikipedia - rename 'tokenize_twitter' to 'pretokenize_twitter' to indicate that the results are preliminary	2015-05-05 13:59:21 -04:00
Rob Speer	61b9440e3d	add wiki-parsing process	2015-05-04 13:25:01 -04:00

1 2 3 4

162 Commits