wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 18:38:51 +00:00

Author	SHA1	Message	Date
Rob Speer	0ab23f8a28	sort Jieba wordlists consistently; update data files	2015-09-08 16:09:53 -04:00
Rob Speer	bc8ebd23e9	don't do language-specific tokenization in freqs_to_cBpack Tokenizing in the 'merge' step is sufficient.	2015-09-08 14:46:04 -04:00
Rob Speer	715361ca0d	actually fix logic of apostrophe-fixing	2015-09-08 13:50:34 -04:00
Rob Speer	c4c1af8213	fix logic of apostrophe-fixing	2015-09-08 13:47:58 -04:00
Rob Speer	912171f8e7	fix '--language' option definition	2015-09-08 13:27:20 -04:00
Rob Speer	77a9b5c55b	Avoid Chinese tokenizer when building	2015-09-08 12:59:03 -04:00
Rob Speer	9071defb33	language-specific frequency reading; fix 't in English	2015-09-08 12:49:21 -04:00
Rob Speer	20f2828d0a	Merge branch 'apostrophe-fix' into chinese-scripts Conflicts: wordfreq_builder/wordfreq_builder/word_counts.py	2015-09-08 12:29:00 -04:00
Rob Speer	e39d345c4b	WIP: fix apostrophe trimming	2015-09-08 12:28:28 -04:00
Rob Speer	d576e3294b	update the README for Chinese	2015-09-05 03:42:54 -04:00
Rob Speer	2327f2e4d6	tokenize Chinese using jieba and our own frequencies	2015-09-05 03:16:56 -04:00
Rob Speer	7906a671ea	WIP: Traditional Chinese	2015-09-04 18:52:37 -04:00
Rob Speer	3c3371a9ff	add Polish and Swedish to README	2015-09-04 17:10:40 -04:00
Rob Speer	447d7e5134	add Polish and Swedish, which have sufficient data	2015-09-04 17:10:40 -04:00
Rob Speer	25edaad962	update data files	2015-09-04 17:00:55 -04:00
Rob Speer	fc93c8dc9c	add tests for Turkish	2015-09-04 17:00:05 -04:00
Rob Speer	5c7a7ea83e	We can put the cutoff back now I took it out when a step in the English SUBTLEX process was outputting frequencies instead of counts, but I've fixed that now.	2015-09-04 16:16:52 -04:00
Rob Speer	56318a3ca3	remove subtlex-gr from README	2015-09-04 16:11:46 -04:00
Rob Speer	8196643509	add more citations	2015-09-04 15:57:40 -04:00
Rob Speer	77c60c29b0	Use SUBTLEX for German, but OpenSubtitles for Greek In German and Greek, SUBTLEX and Hermit Dave turn out to have been working from the same source data. I looked at the quality of how they processed the data, and chose SUBTLEX for German, and Dave's wordlist for Greek.	2015-09-04 15:52:21 -04:00
Rob Speer	a47497c908	update data files (without the CLD2 fix yet)	2015-09-04 14:58:20 -04:00
Rob Speer	0d3ee869c1	Exclude angle brackets from CLD2 detection	2015-09-04 14:56:06 -04:00
Rob Speer	81bbe663fb	update README with additional SUBTLEX support	2015-09-04 13:23:33 -04:00
Rob Speer	34474939f2	add more SUBTLEX and fix its build rules	2015-09-04 12:37:35 -04:00
Rob Speer	c11e3b7a9d	update the data	2015-09-04 02:07:50 -04:00
Rob Speer	531db64288	Note on next languages to support	2015-09-04 01:50:15 -04:00
Rob Speer	d9a1c34d00	expand list of sources and supported languages	2015-09-04 01:03:36 -04:00
Rob Speer	d94428d454	support Turkish and more Greek; document more	2015-09-04 00:57:04 -04:00
Rob Speer	45d871a815	Merge branch 'add-subtlex' into greek-and-turkish	2015-09-03 23:26:14 -04:00
Rob Speer	40d82541ba	refer to merge_freqs command correctly	2015-09-03 23:25:46 -04:00
Rob Speer	a3daba81eb	expand Greek and enable Turkish in config	2015-09-03 23:23:31 -04:00
Rob Speer	e6a2886a66	add SUBTLEX to the readme	2015-09-03 18:56:56 -04:00
Rob Speer	2d58ba94f2	Add SUBTLEX as a source of English and Chinese data Meanwhile, fix up the dependency graph thingy. It's actually kind of legible now.	2015-09-03 18:13:13 -04:00
Rob Speer	07228fdf1d	Merge pull request #24 from LuminosoInsight/manifest Remove the no-longer-existent .txt files from the MANIFEST.	2015-09-02 14:34:17 -04:00
Andrew Lin	db41bc7902	Remove the no-longer-existent .txt files from the MANIFEST.	2015-09-02 14:27:15 -04:00
Andrew Lin	e43b5ebf7b	Merge pull request #23 from LuminosoInsight/readme Put documentation and examples in the README	2015-08-28 17:59:17 -04:00
Rob Speer	00a2812907	fix heading	2015-08-28 17:49:38 -04:00
Rob Speer	93f44683c5	fix list formatting	2015-08-28 17:49:07 -04:00
Rob Speer	5def3a7897	update the build diagram and its script	2015-08-28 17:47:04 -04:00
Rob Speer	2370287539	improve README with function documentation and examples	2015-08-28 17:45:50 -04:00
Andrew Lin	e6d9b36203	Merge pull request #22 from LuminosoInsight/standard-tokenizer Use a more standard Unicode tokenizer	2015-08-27 11:56:19 -04:00
Rob Speer	b952676679	update data files	2015-08-27 03:58:54 -04:00
Rob Speer	d5fcf4407e	copyedit regex comments	2015-08-26 17:04:56 -04:00
Rob Speer	34375958ef	fix typo in docstring	2015-08-26 16:24:35 -04:00
Rob Speer	c4a2594217	fix URL expression	2015-08-26 15:00:46 -04:00
Rob Speer	f7babea352	correct the simple_tokenize docstring	2015-08-26 13:54:50 -04:00
Rob Speer	01b6403ef4	refactor the token expression	2015-08-26 13:40:47 -04:00
Rob Speer	a893823d6e	un-flake wordfreq_builder.tokenizers, and edit docstrings	2015-08-26 13:03:23 -04:00
Rob Speer	94467a6563	remove regex files that are no longer needed	2015-08-26 11:48:11 -04:00
Rob Speer	694c28d5e4	bump to version 1.1	2015-08-25 17:44:52 -04:00

1 2 3 4 5 ...

416 Commits