wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 01:41:39 +00:00

Author	SHA1	Message	Date
Rob Speer	df8caaff7d	build a bigger wordlist that we can optionally use	2016-01-12 14:05:57 -05:00
Rob Speer	8d9668d8ab	fix usage text: one comment, not one tweet	2016-01-12 13:05:38 -05:00
Rob Speer	115c74583e	Separate tokens with spaces, not line breaks, in intermediate files	2016-01-12 12:59:18 -05:00
Rob Speer	973caca253	builder: Use an optional cutoff when merging counts This allows the Reddit-merging step to not use such a ludicrous amount of memory.	2015-12-15 14:44:34 -05:00
Rob Speer	9a5d9d66bb	gzip the intermediate step of Reddit word counting	2015-12-09 13:30:08 -05:00
Rob Speer	95f53e295b	no Thai because we can't tokenize it	2015-12-02 12:38:03 -05:00
Rob Speer	8f6cd0e57b	forgot about Italian	2015-11-30 18:18:24 -05:00
Rob Speer	5ef807117d	add tokenizer for Reddit	2015-11-30 18:16:54 -05:00
Rob Speer	b2d7546d2d	add word frequencies from the Reddit 2007-2015 corpus	2015-11-30 16:38:11 -05:00
Rob Speer	9b1c4d66cd	fix missing word in rules.ninja comment	2015-09-24 17:56:06 -04:00
Rob Speer	f224b8dbba	describe the use of `lang` in `read_values`	2015-09-22 17:22:38 -04:00
Rob Speer	7c12f2aca1	Make the jieba_deps comment make sense	2015-09-22 17:19:00 -04:00
Rob Speer	3cb3061e06	Merge branch 'greek-and-turkish' into chinese-and-more Conflicts: README.md wordfreq_builder/wordfreq_builder/ninja.py	2015-09-10 15:27:33 -04:00
Rob Speer	a4f8d11427	In ninja deps, remove 'startrow' as a variable	2015-09-10 13:46:19 -04:00
Rob Speer	2277ad3116	fix spelling of Marc	2015-09-09 13:35:02 -04:00
Rob Speer	354555514f	fixes based on code review notes	2015-09-09 13:10:18 -04:00
Rob Speer	d9c44d5fcc	take out OpenSubtitles for Chinese	2015-09-08 17:25:05 -04:00
Rob Speer	bc323eccaf	update comments in wordfreq_builder.config; remove unused 'version'	2015-09-08 16:15:29 -04:00
Rob Speer	0ab23f8a28	sort Jieba wordlists consistently; update data files	2015-09-08 16:09:53 -04:00
Rob Speer	bc8ebd23e9	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient.	2015-09-08 14:46:04 -04:00
Rob Speer	715361ca0d	actually fix logic of apostrophe-fixing	2015-09-08 13:50:34 -04:00
Rob Speer	c4c1af8213	fix logic of apostrophe-fixing	2015-09-08 13:47:58 -04:00
Rob Speer	912171f8e7	fix '--language' option definition	2015-09-08 13:27:20 -04:00
Rob Speer	77a9b5c55b	Avoid Chinese tokenizer when building	2015-09-08 12:59:03 -04:00
Rob Speer	9071defb33	language-specific frequency reading; fix 't in English	2015-09-08 12:49:21 -04:00
Rob Speer	20f2828d0a	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py	2015-09-08 12:29:00 -04:00
Rob Speer	e39d345c4b	WIP: fix apostrophe trimming	2015-09-08 12:28:28 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	447d7e5134	add Polish and Swedish, which have sufficient data	2015-09-04 17:10:40 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	34474939f2	add more SUBTLEX and fix its build rules	2015-09-04 12:37:35 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	45d871a815	Merge branch 'add-subtlex' into greek-and-turkish	2015-09-03 23:26:14 -04:00
Rob Speer	40d82541ba	refer to merge_freqs command correctly	2015-09-03 23:25:46 -04:00
Rob Speer	a3daba81eb	expand Greek and enable Turkish in config	2015-09-03 23:23:31 -04:00
Rob Speer	2d58ba94f2	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now.	2015-09-03 18:13:13 -04:00
Rob Speer	5def3a7897	update the build diagram and its script	2015-08-28 17:47:04 -04:00
Rob Speer	c4a2594217	fix URL expression	2015-08-26 15:00:46 -04:00
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	5a1fc00aaa	Strip apostrophes from edges of tokens The issue here is that if you had French text with an apostrophe, such as "d'un", it would split it into "d'" and "un", but if "d'" were re-tokenized it would come out as "d". Stripping apostrophes makes the process more idempotent.	2015-08-25 12:41:48 -04:00
Rob Speer	de73888a76	use better regexes in wordfreq_builder tokenizer	2015-08-24 19:05:46 -04:00
Rob Speer	140ca6c050	remove Hangul fillers that confuse cld2	2015-08-24 17:11:18 -04:00
Andrew Lin	6d40912ef9	Stylistic cleanups to word_counts.py.	2015-07-31 19:26:18 -04:00
Andrew Lin	53621c34df	Remove redundant reference to wikipedia in builder README.	2015-07-31 19:12:59 -04:00
Rob Speer	e9f9c94e36	Don't use the file-reading cutoff when writing centibels	2015-07-28 18:45:26 -04:00

1 2 3 4

155 Commits