wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 17:31:41 +00:00

Author	SHA1	Message	Date
Robyn Speer	9dac967ca3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Andrew Lin	6c118c0b6a	Merge pull request #49 from LuminosoInsight/restore-langcodes Use langcodes when tokenizing again	2017-05-10 16:20:06 -04:00
Robyn Speer	aa3ed23282	v1.6.1: depend on langcodes 1.4	2017-05-10 13:26:23 -04:00
Robyn Speer	71a0ad6abb	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Robyn Speer	ae7bc5764b	Merge pull request #48 from LuminosoInsight/code-review-notes Code review notes	2017-02-15 12:29:25 -08:00
Andrew Lin	c2e1504643	Clarify the changelog.	2017-02-14 13:09:12 -05:00
Andrew Lin	1363f9d2e0	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Andrew Lin	72e3678e89	Merge pull request #47 from LuminosoInsight/all-1.6-changes All 1.6 changes	2017-02-01 15:36:38 -05:00
Robyn Speer	a099a5a881	Remove ninja2dot script, which is no longer used	2017-02-01 14:49:44 -05:00
Robyn Speer	7dec335f74	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Robyn Speer	19b72132e7	Fix some outdated numbers in English examples	2017-01-31 18:25:41 -05:00
Robyn Speer	abd0820a32	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Robyn Speer	93306e55a0	Update README with new examples and URL	2017-01-09 15:13:19 -05:00
Robyn Speer	9a6beb0089	test that number-smashing still happens in freq lookups	2017-01-06 19:20:41 -05:00
Robyn Speer	573ecc53d0	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Robyn Speer	3cb3c38f47	update the README, citing OpenSubtitles 2016	2017-01-06 19:04:40 -05:00
Robyn Speer	86f22e8523	Mention that multi-digit numbers are combined together	2017-01-05 19:24:28 -05:00
Robyn Speer	48a5967e9a	mention tokenization change in changelog	2017-01-05 19:19:31 -05:00
Robyn Speer	39e459ac71	Update documentation and bump version to 1.6	2017-01-05 19:18:06 -05:00
Robyn Speer	23c7c8e936	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Robyn Speer	7dc3f03ebd	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Robyn Speer	de32a15b4f	Merge branch 'transliterate-serbian' into all-1.6-changes	2017-01-05 17:57:52 -05:00
Robyn Speer	d66d04210f	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Robyn Speer	87b03325db	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Robyn Speer	c27e7f9b76	remove wordfreq_builder (obsoleted by exquisite-corpus)	2017-01-04 17:45:53 -05:00
Robyn Speer	6211b35fb3	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Robyn Speer	0aa7ad46ae	fixes to tokenization	2016-12-13 14:43:29 -05:00
Robyn Speer	d6d528de74	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Robyn Speer	a8e2fa5acf	add a test for "aujourd'hui"	2016-12-06 17:39:40 -05:00
Robyn Speer	21a78f5eb9	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Robyn Speer	82eba05f2d	eh, this is still version 1.5.2, not 1.6	2016-12-05 18:58:33 -05:00
Robyn Speer	4376636316	add a specific test in Catalan	2016-12-05 18:54:51 -05:00
Robyn Speer	ff5a8f2a65	add tests for French apostrophe tokenization	2016-12-05 18:54:51 -05:00
Robyn Speer	596368ac6e	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Lance Nathan	7f26270644	Merge pull request #45 from LuminosoInsight/citation Describe how to cite wordfreq	2016-09-12 18:34:55 -04:00
Robyn Speer	7fabbfef31	Describe how to cite wordfreq This citation was generated from our GitHub repository by Zenodo. Their defaults indicate that anyone who's ever accepted a PR for the code should go on the author line, and that sounds fine to me.	2016-09-12 18:24:55 -04:00
Robyn Speer	c0fbd844f6	Add a changelog	2016-08-22 12:41:39 -04:00
Andrew Lin	976c8df0fd	Merge pull request #44 from LuminosoInsight/mecab-loading-fix Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:59:44 -04:00
Robyn Speer	aa880bcd84	bump version to 1.5.1	2016-08-19 11:42:29 -04:00
Robyn Speer	e1d6e7d96f	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Andrew Lin	e4b32afa18	Merge pull request #42 from LuminosoInsight/mecab-finder Look for MeCab dictionaries in various places besides this package Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628	2016-08-08 16:00:39 -04:00
Robyn Speer	88c93f6204	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Robyn Speer	6440d81676	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Robyn Speer	c11998e506	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Robyn Speer	bc1cfc35c8	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Robyn Speer	9e55f8fed1	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	875dd5669f	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	ce5a91d732	Make the almost-median deterministic when it rounds down to 0 Former-commit-id: `74892a0ac9`	2016-07-29 12:34:56 -04:00

1 2 3 4 5 ...

551 Commits