wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-26 10:28:52 +00:00

Author	SHA1	Message	Date
Robyn Speer	bb1bd50c44	ignore the 'scripts' dir when collecting tests	2019-02-20 17:21:07 -05:00
Moss Collum	a17587dcbb	Merge pull request #69 from LuminosoInsight/revert-68-pytest-jenkins Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:57 -05:00
Moss Collum	26cbb5a7c8	Revert "Build with Pytest on Jenkins"	2019-02-13 18:11:44 -05:00
Lance Nathan	53ec5d87d2	Merge pull request #68 from LuminosoInsight/pytest-jenkins Build with Pytest on Jenkins	2019-02-13 17:57:16 -05:00
Moss Collum	92c3ca0a66	Build with Pytest on Jenkins	2019-02-13 17:56:20 -05:00
Robyn Speer	0931f1297d	update changelog for v2.2.1	2019-02-05 15:58:10 -05:00
Lance Nathan	1442ee044d	Merge pull request #66 from LuminosoInsight/update-msgpack-call Update msgpack parameter	2019-02-05 11:17:07 -05:00
Robyn Speer	36fd42ca08	update msgpack call in scripts/make_chinese_mapping	2019-02-05 11:16:22 -05:00
Robyn Speer	c7a14cd4ab	update encoding='utf-8' to raw=False	2019-02-04 14:57:38 -05:00
Moss Collum	0b69118558	Add Jenkinsfile to drive internal build scripts	2019-02-01 19:05:35 -05:00
Robyn Speer	4cd7b4bada	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Lance Nathan	fa8be1962b	Merge pull request #62 from LuminosoInsight/name-update Update my name and the Zenodo citation	2018-10-03 17:30:47 -04:00
Robyn Speer	51ca052b62	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Lance Nathan	bc12599010	Merge pull request #60 from LuminosoInsight/gender-neutral-at Recognize "@" in gender-neutral word endings as part of the token	2018-07-24 18:16:31 -04:00
Rob Speer	d9fc6ec42c	update the changelog for version 2.2	2018-07-23 16:38:39 -04:00
Rob Speer	0644c8920a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Rob Speer	d06a6a48c5	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Rob Speer	b2d242e8bf	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Rob Speer	ca9cf7d90f	update the CHANGELOG for MeCab fix	2018-06-26 11:31:03 -04:00
Lance Nathan	3961a28973	Merge pull request #59 from LuminosoInsight/korean-install-fixes Korean install fixes	2018-06-26 11:08:06 -04:00
Lance Nathan	a619ba6457	Merge pull request #58 from LuminosoInsight/significant-figures Round wordfreq output to 3 sig. figs, and update documentation	2018-06-25 18:53:39 -04:00
Rob Speer	676686fda1	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00
Rob Speer	5e05c942ac	doctest the README	2018-06-18 17:11:42 -04:00
Rob Speer	1dc763c9c5	update README and CHANGELOG	2018-06-18 15:21:43 -04:00
Rob Speer	c3b32b3c4a	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Lance Nathan	0911e90ba0	Merge pull request #57 from LuminosoInsight/version2.1 Version 2.1	2018-06-18 12:06:47 -04:00
Rob Speer	2b85a1cef2	update table in README: Dutch has 5 sources	2018-06-18 11:43:52 -04:00
Rob Speer	52aae3459d	fix typo in previous changelog entry	2018-06-18 10:52:28 -04:00
Rob Speer	2f6b87c86b	relax the test that assumed the Chinese list has few ASCII words	2018-06-15 16:29:15 -04:00
Rob Speer	57f676f4a6	fixes to tests, including that 'test.py' wasn't found by pytest	2018-06-15 15:48:41 -04:00
Rob Speer	93e3e03c60	update tests to include new languages Also, it's easy to say `>=` in pytest	2018-06-12 17:55:44 -04:00
Rob Speer	93ddc192d8	bump version to 2.1; add test requirement for pytest	2018-06-12 17:48:24 -04:00
Rob Speer	ff4f7bf3f6	Merge remote-tracking branch 'origin/pytest' into version2.1	2018-06-12 17:46:48 -04:00
Rob Speer	db43e0e25c	New data import from exquisite-corpus Significant changes in this data include: - Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data. ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies. ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. - Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else. All words should now go through the aggressive normalization of `lossy_tokenize`. - Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as `=` were slipping through in some cases but not others. - As a result of the new data, Latvian becomes a supported language and Czech gets promoted to a 'large' language.	2018-06-12 17:22:43 -04:00
Rob Speer	96a01b9685	port remaining tests to pytest	2018-06-01 16:40:51 -04:00
Rob Speer	863d5be522	port test.py and test_chinese.py to pytest	2018-06-01 16:33:06 -04:00
Rob Speer	8fcae9978e	Use data from fixed XC build - mostly changes Chinese	2018-05-30 13:09:20 -04:00
Rob Speer	90b5246a48	commit new data files (Italian changed for some reason)	2018-05-29 17:36:48 -04:00
Rob Speer	cd434b2219	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Rob Speer	aa91e1f291	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Lance Nathan	968bc3a85a	Merge pull request #56 from LuminosoInsight/japanese-edge-cases Handle Japanese edge cases in `simple_tokenize`	2018-05-01 14:57:45 -04:00
Rob Speer	0a95d96b20	update CHANGELOG for 2.0.1	2018-05-01 14:47:55 -04:00
Rob Speer	3ec92a8952	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Lance Nathan	e3a1b470d9	Merge pull request #55 from LuminosoInsight/version2 Version 2, with standalone text pre-processing	2018-03-15 14:26:49 -04:00
Rob Speer	a759f38540	update the changelog	2018-03-14 17:56:29 -04:00
Rob Speer	6f1a9aaff1	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Rob Speer	1a761199cd	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
Rob Speer	b2bdc8a854	Actually use `min_score` in `_language_in_list` We don't need to set it to any value but 80 now, but we will need to if we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and unified zh-Hani).	2018-03-14 15:08:52 -04:00
Rob Speer	bb2096ae04	code review fixes to wordfreq.tokens	2018-03-14 15:07:45 -04:00
Rob Speer	430fb01e53	code review fixes to __init__	2018-03-14 15:04:59 -04:00

1 2 3 4 5 ...

622 Commits