wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Robyn Speer	a75a95658b	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now. Former-commit-id: `5c7a7ea83e`	2015-09-04 16:16:52 -04:00
Robyn Speer	f330d6d130	remove subtlex-gr from README Former-commit-id: `56318a3ca3`	2015-09-04 16:11:46 -04:00
Robyn Speer	8277b34571	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek. Former-commit-id: `77c60c29b0`	2015-09-04 15:52:21 -04:00
Robyn Speer	a69b66b210	Exclude angle brackets from CLD2 detection Former-commit-id: `0d3ee869c1`	2015-09-04 14:56:06 -04:00
Robyn Speer	d0ada70355	add more SUBTLEX and fix its build rules Former-commit-id: `34474939f2`	2015-09-04 12:37:35 -04:00
Robyn Speer	14136d2a01	Note on next languages to support Former-commit-id: `531db64288`	2015-09-04 01:50:15 -04:00
Robyn Speer	574c383202	support Turkish and more Greek; document more Former-commit-id: `d94428d454`	2015-09-04 00:57:04 -04:00
Robyn Speer	f168c37417	Merge branch 'add-subtlex' into greek-and-turkish Former-commit-id: `45d871a815`	2015-09-03 23:26:14 -04:00
Robyn Speer	76c751652e	refer to merge_freqs command correctly Former-commit-id: `40d82541ba`	2015-09-03 23:25:46 -04:00
Robyn Speer	3446a393c5	expand Greek and enable Turkish in config Former-commit-id: `a3daba81eb`	2015-09-03 23:23:31 -04:00
Robyn Speer	f66d03b1b9	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now. Former-commit-id: `2d58ba94f2`	2015-09-03 18:13:13 -04:00
Robyn Speer	247d7c6579	update the build diagram and its script Former-commit-id: `5def3a7897`	2015-08-28 17:47:04 -04:00
Robyn Speer	af29fc4f88	fix URL expression Former-commit-id: `c4a2594217`	2015-08-26 15:00:46 -04:00
Robyn Speer	3a140ee02f	un-flake wordfreq_builder.tokenizers, and edit docstrings Former-commit-id: `a893823d6e`	2015-08-26 13:03:23 -04:00
Robyn Speer	b22a4b0f02	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent. Former-commit-id: `5a1fc00aaa`	2015-08-25 12:41:48 -04:00
Robyn Speer	8637aaef9e	use better regexes in wordfreq_builder tokenizer Former-commit-id: `de73888a76`	2015-08-24 19:05:46 -04:00
Robyn Speer	4ec128adae	remove Hangul fillers that confuse cld2 Former-commit-id: `140ca6c050`	2015-08-24 17:11:18 -04:00
Andrew Lin	77610f57e1	Stylistic cleanups to word_counts.py. Former-commit-id: `6d40912ef9`	2015-07-31 19:26:18 -04:00
Andrew Lin	0711fb3c43	Remove redundant reference to wikipedia in builder README. Former-commit-id: `53621c34df`	2015-07-31 19:12:59 -04:00
Robyn Speer	e9dd253f1d	Don't use the file-reading cutoff when writing centibels Former-commit-id: `e9f9c94e36`	2015-07-28 18:45:26 -04:00
Robyn Speer	3ff0f30218	put back the freqs_to_cBpack cutoff; prepare for 1.0 Former-commit-id: `c5708b24e4`	2015-07-28 18:01:12 -04:00
Robyn Speer	33e0493fd5	Merge pull request #19 from LuminosoInsight/code-review-fixes-2015-07-17 Code review fixes 2015 07 17 Former-commit-id: `32102ba3c2`	2015-07-22 15:09:00 -04:00
Joshua Chin	292fc96142	updated read_freqs docs Former-commit-id: `93cd902899`	2015-07-22 10:06:16 -04:00
Joshua Chin	d629e8b6cc	fixed style Former-commit-id: `4fe9d110e1`	2015-07-22 10:05:11 -04:00
Joshua Chin	f9742c94ca	reordered command line args Former-commit-id: `6453d864c4`	2015-07-22 10:04:14 -04:00
Joshua Chin	474ae0da35	bugfix Former-commit-id: `8081145922`	2015-07-21 10:12:56 -04:00
Joshua Chin	34504eed80	fixed rules.ninja Former-commit-id: `c5f82ecac1`	2015-07-20 17:20:29 -04:00
Joshua Chin	61a03b87bc	fixed build bug Former-commit-id: `643571c69c`	2015-07-20 16:51:25 -04:00
Joshua Chin	af8050f1b8	ensure removal of tatweels (hopefully) Former-commit-id: `173278fdd3`	2015-07-20 16:48:36 -04:00
Joshua Chin	675a02ac11	unhoisted if statement Former-commit-id: `298d3c1d24`	2015-07-20 11:10:41 -04:00
Joshua Chin	98cbef4ecf	ninja.py is now pep8 compliant Former-commit-id: `accb7e398c`	2015-07-20 11:06:58 -04:00
Joshua Chin	44669bd3a9	fixed build Former-commit-id: `221acf7921`	2015-07-17 17:44:01 -04:00
Robyn Speer	ea2c6adbc4	mention the Wikipedia data, and credit Hermit Dave Former-commit-id: `2d1020daac`	2015-07-17 17:09:36 -04:00
Joshua Chin	ec871bb6ca	fixed tokenize_twitter Former-commit-id: `f31f9a1bcd`	2015-07-17 16:37:47 -04:00
Joshua Chin	71ff0c62d6	added cld2 tokenizer comments Former-commit-id: `a44927e98e`	2015-07-17 16:03:33 -04:00
Joshua Chin	c2f3928433	fix arabic tokens Former-commit-id: `11a1c51321`	2015-07-17 15:52:12 -04:00
Joshua Chin	d283183743	fixed syntax Former-commit-id: `c75c735d8d`	2015-07-17 15:43:24 -04:00
Joshua Chin	3962b475c1	renamed tokenize file to tokenize twitter Former-commit-id: `303bd88ba2`	2015-07-17 15:27:26 -04:00
Joshua Chin	2f73cc535c	created last_tab flag Former-commit-id: `d6519cf736`	2015-07-17 15:19:09 -04:00
Joshua Chin	4117480a0e	removed uncessary if statement Former-commit-id: `620becb7e8`	2015-07-17 15:14:06 -04:00
Joshua Chin	b6d03324b9	generated freq dict in place Former-commit-id: `d988b1b42e`	2015-07-17 15:13:25 -04:00
Joshua Chin	c0650a6893	corrected docstring Former-commit-id: `e37c689031`	2015-07-17 15:12:23 -04:00
Joshua Chin	be8921869a	removed unnecessary strip Former-commit-id: `002351bace`	2015-07-17 15:11:28 -04:00
Joshua Chin	d7feab1c28	moved last_tab to tokenize_twitter Former-commit-id: `7fc23666a9`	2015-07-17 15:10:17 -04:00
Joshua Chin	200c271083	removed unused function Former-commit-id: `528285a982`	2015-07-17 15:03:14 -04:00
Joshua Chin	c84ac8d62a	fixed spacing Former-commit-id: `59d3c72758`	2015-07-17 15:02:34 -04:00
Joshua Chin	2258c4e55b	removed unnecessary format Former-commit-id: `10028be212`	2015-07-17 15:01:25 -04:00
Joshua Chin	368e4f3cca	cleaned up BAD_CHAR_RANGE Former-commit-id: `3b368b66dd`	2015-07-17 15:00:59 -04:00
Joshua Chin	78e9cf5d8f	moved test tokenizers Former-commit-id: `c2d1cdcb31`	2015-07-17 14:58:58 -04:00
Joshua Chin	4bfdd263b7	added docstring and moved to scripts Former-commit-id: `5d26c9f57f`	2015-07-17 14:56:18 -04:00
Joshua Chin	09ccb862ba	style changes Former-commit-id: `bdc791af8f`	2015-07-17 14:54:32 -04:00
Joshua Chin	85fe540a06	removed bad comment Former-commit-id: `4d5ec57144`	2015-07-17 14:54:09 -04:00
Joshua Chin	eb9add9d71	removed unused scripts Former-commit-id: `39f01b0485`	2015-07-17 14:53:18 -04:00
Joshua Chin	a340a15870	removed mkdir -p for many cases Former-commit-id: `98a7a8093b`	2015-07-17 14:45:22 -04:00
Joshua Chin	354f09ec24	removed TOKENIZE_TWITTER Former-commit-id: `449a656edd`	2015-07-17 14:43:14 -04:00
Joshua Chin	d0df4cc9a4	removed TOKENIZE_TWITTER option Former-commit-id: `00e18b7d4b`	2015-07-17 14:40:49 -04:00
Joshua Chin	46b2730601	more README fixes Former-commit-id: `772c0cddd1`	2015-07-17 14:40:33 -04:00
Joshua Chin	3e4643f9c4	fixed README Former-commit-id: `0a085132f4`	2015-07-17 14:35:43 -04:00
Robyn Speer	73bacc659d	update the wordfreq_builder README Former-commit-id: `8633e8c2a9`	2015-07-13 11:58:48 -04:00
Robyn Speer	e9d88bf35e	add docstrings and remove some brackets	2015-07-07 18:22:51 -04:00
Joshua Chin	1c365e6a50	Removes mention of Rosette from README	2015-07-07 10:32:16 -04:00
Robyn Speer	3eb3e7c388	add 'twitter' as a final build, and a new build dir The `data/dist` directory is now a convenient place to find the final built files that can be copied into wordfreq.	2015-07-01 17:45:39 -04:00
Robyn Speer	58c8bda21b	cope with occasional Unicode errors in the input	2015-06-30 17:05:40 -04:00
Robyn Speer	deed2f767c	remove wiki2tokens and tokenize_wikipedia These components are no longer necessary. Wikipedia output can and should be tokenized with the standard tokenizer, instead of the almost-equivalent one in the Nim code.	2015-06-30 15:28:01 -04:00
Robyn Speer	f17a04aa84	fix comment and whitespace involving tokenize_twitter	2015-06-30 15:18:37 -04:00
Robyn Speer	91d6edd55b	Switch to a centibel scale, add a header to the data	2015-06-22 17:38:13 -04:00
Robyn Speer	3108d24d76	Merge pull request #2 from LuminosoInsight/review-refactor Adds a number of bugfixes and improvements to wordfreq_builder	2015-06-19 15:29:52 -04:00
Robyn Speer	a83cf82adb	restore missing Russian OpenSubtitles data	2015-06-19 12:36:08 -04:00
Joshua Chin	1385b735cf	updated freqs_to_dBpack docstring	2015-06-18 10:32:53 -04:00
Joshua Chin	3596434f7f	revised read_freqs docstring	2015-06-18 10:28:22 -04:00
Joshua Chin	18b53f6071	updated monolingual_tokenize_file docstring, and removed unused argument	2015-06-18 10:20:54 -04:00
Joshua Chin	34e9512517	tokenize_file should ignore lines with unknown languages	2015-06-18 10:18:57 -04:00
Joshua Chin	2f4fe92c90	Fixed CLD2_BAD_CHAR regex	2015-06-18 10:18:00 -04:00
Joshua Chin	87285b8b90	changed tokenize_file: cld2 return 'un' instead of None if it cannot recognize the language	2015-06-17 14:19:28 -04:00
Joshua Chin	b5bc39c893	tokenize_file: don't join tokens if language is None	2015-06-17 14:18:18 -04:00
Joshua Chin	7fc0ba9092	automatically closes input file in tokenize_file	2015-06-17 11:42:34 -04:00
Joshua Chin	2039b18b71	updated test to check number parsing	2015-06-17 11:30:25 -04:00
Joshua Chin	dad23c117a	fixed build process	2015-06-17 11:25:07 -04:00
Joshua Chin	a495de9f65	updated directory of twitter output	2015-06-16 17:32:58 -04:00
Joshua Chin	6f0a082007	removed intermediate twitter file rules	2015-06-16 17:28:09 -04:00
Joshua Chin	42ca1f2523	improved tokenize_file and updated docstring	2015-06-16 17:27:27 -04:00
Joshua Chin	80afc5dc45	renamed pretokenize_twitter to tokenize twitter, and deleted format_twitter	2015-06-16 17:26:52 -04:00
Joshua Chin	20bc34f224	fixed bugs and removed unused code	2015-06-16 17:25:06 -04:00
Joshua Chin	aa0bef3fb7	changed tokenizer to only strip t.co urls	2015-06-16 16:11:31 -04:00
Joshua Chin	8dd17fded4	Added codepoints U+10FFFE and U+10FFFF to CLD2_BAD_CHAR_RANGE	2015-06-16 16:03:58 -04:00
Joshua Chin	308cdbb4c4	added tests for the tokenizer and language recognizer	2015-06-16 16:00:14 -04:00
Joshua Chin	e57a88b548	added pycld2 dependency	2015-06-16 15:06:22 -04:00
Joshua Chin	7a3cd8068c	Replaced Rosette with cld2 language recognizer and wordfreq tokenizer	2015-06-16 14:45:49 -04:00
Robyn Speer	6cd6ab33bc	ninja2dot: make a graph of the build process	2015-06-15 13:14:32 -04:00
Robyn Speer	26b03392fe	Reorganize and document some functions	2015-06-15 12:40:31 -04:00
Robyn Speer	04ad6720cc	okay, apparently you can't mix code blocks and bullets	2015-06-01 11:39:42 -04:00
Robyn Speer	69d9e89bb8	is this indented enough for you, markdown	2015-06-01 11:38:10 -04:00
Robyn Speer	dcc1e87728	add a README	2015-06-01 11:37:19 -04:00
Robyn Speer	296901b93f	Tokenize Japanese consistently with MeCab	2015-05-27 17:44:58 -04:00
Robyn Speer	a5954d14df	give mecab a larger buffer	2015-05-26 19:34:46 -04:00
Robyn Speer	b9a5e05f87	fix build rules for Japanese Wikipedia	2015-05-26 18:08:57 -04:00
Robyn Speer	353533bba4	fix version in config.py	2015-05-26 18:08:46 -04:00
Robyn Speer	4f738ad78c	correct a Leeds bug; add some comments to rules.ninja	2015-05-26 18:08:04 -04:00
Robyn Speer	4513fed60c	add Google Books data for English	2015-05-11 18:44:28 -04:00
Robyn Speer	ed4f79b90e	move some functions to the wordfreq package	2015-05-11 17:02:52 -04:00

1 2 3 4

175 Commits