wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-23 09:21:37 +00:00

Author	SHA1	Message	Date
Robyn Speer	a3834180c9	update changelog for v2.2.1	2019-02-05 15:58:10 -05:00
Lance Nathan	96b9808550	Merge pull request #66 from LuminosoInsight/update-msgpack-call Update msgpack parameter	2019-02-05 11:17:07 -05:00
Robyn Speer	dd72051929	update msgpack call in scripts/make_chinese_mapping	2019-02-05 11:16:22 -05:00
Robyn Speer	61a1604b38	update encoding='utf-8' to raw=False	2019-02-04 14:57:38 -05:00
Moss Collum	65a6a89993	Add Jenkinsfile to drive internal build scripts	2019-02-01 19:05:35 -05:00
Robyn Speer	d30183a7d7	Allow a wider range of 'regex' versions The behavior of segmentation shouldn't change within this range, and it includes the version currently used by SpaCy.	2018-10-25 11:07:55 -04:00
Lance Nathan	c1fe37bab5	Merge pull request #62 from LuminosoInsight/name-update Update my name and the Zenodo citation	2018-10-03 17:30:47 -04:00
Robyn Speer	563e8f7444	Update my name and the Zenodo citation	2018-10-03 17:27:10 -04:00
Lance Nathan	2f8600e975	Merge pull request #60 from LuminosoInsight/gender-neutral-at Recognize "@" in gender-neutral word endings as part of the token	2018-07-24 18:16:31 -04:00
Robyn Speer	287df17a71	update the changelog for version 2.2	2018-07-23 16:38:39 -04:00
Robyn Speer	f73406c69a	Update README to describe @ tokenization	2018-07-23 11:21:44 -04:00
Robyn Speer	86b928f967	include data from xc rebuild	2018-07-15 01:01:35 -04:00
Robyn Speer	65692c3d81	Recognize "@" in gender-neutral word endings as part of the token	2018-07-03 13:22:56 -04:00
Robyn Speer	7bf69595bb	update the CHANGELOG for MeCab fix	2018-06-26 11:31:03 -04:00
Lance Nathan	0149e9ec7f	Merge pull request #59 from LuminosoInsight/korean-install-fixes Korean install fixes	2018-06-26 11:08:06 -04:00
Lance Nathan	79caa526c3	Merge pull request #58 from LuminosoInsight/significant-figures Round wordfreq output to 3 sig. figs, and update documentation	2018-06-25 18:53:39 -04:00
Robyn Speer	830157d8e4	Fix instructions and search path for mecab-ko-dic I'm starting a new Python environment on a new Ubuntu installation. You never know when a huge yak will show up and demand to be shaved. I tried following the directions in the README, and found that a couple of steps were missing. I've added those. When you follow those steps, it appears to install the MeCab Korean dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none of the paths we were checking, so I've added that as a search path.	2018-06-21 15:56:54 -04:00
Robyn Speer	fdf064b234	doctest the README	2018-06-18 17:11:42 -04:00
Robyn Speer	c6552f923f	update README and CHANGELOG	2018-06-18 15:21:43 -04:00
Robyn Speer	7a32b56c1c	Round frequencies to 3 significant digits	2018-06-18 15:21:33 -04:00
Lance Nathan	a95b360563	Merge pull request #57 from LuminosoInsight/version2.1 Version 2.1	2018-06-18 12:06:47 -04:00
Robyn Speer	39a1308770	update table in README: Dutch has 5 sources	2018-06-18 11:43:52 -04:00
Robyn Speer	0280f82496	fix typo in previous changelog entry	2018-06-18 10:52:28 -04:00
Robyn Speer	42efcfc1ad	relax the test that assumed the Chinese list has few ASCII words	2018-06-15 16:29:15 -04:00
Robyn Speer	ad0f046f47	fixes to tests, including that 'test.py' wasn't found by pytest	2018-06-15 15:48:41 -04:00
Robyn Speer	a975bcedae	update tests to include new languages Also, it's easy to say `>=` in pytest	2018-06-12 17:55:44 -04:00
Robyn Speer	4b7e3d9655	bump version to 2.1; add test requirement for pytest	2018-06-12 17:48:24 -04:00
Robyn Speer	3259c4a375	Merge remote-tracking branch 'origin/pytest' into version2.1	2018-06-12 17:46:48 -04:00
Robyn Speer	d5f7335d90	New data import from exquisite-corpus Significant changes in this data include: - Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data. ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies. ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. - Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else. All words should now go through the aggressive normalization of `lossy_tokenize`. - Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as `=` were slipping through in some cases but not others. - As a result of the new data, Latvian becomes a supported language and Czech gets promoted to a 'large' language.	2018-06-12 17:22:43 -04:00
Robyn Speer	b3c42be331	port remaining tests to pytest	2018-06-01 16:40:51 -04:00
Robyn Speer	75b4d62084	port test.py and test_chinese.py to pytest	2018-06-01 16:33:06 -04:00
Robyn Speer	6235d88869	Use data from fixed XC build - mostly changes Chinese	2018-05-30 13:09:20 -04:00
Robyn Speer	5762508e7c	commit new data files (Italian changed for some reason)	2018-05-29 17:36:48 -04:00
Robyn Speer	e4cb9a23b6	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Robyn Speer	8907423147	Packaging updates for the new PyPI I _almost_ got the description and long_description right for 2.0.1. I even checked it on the test server. But I didn't notice that I was handling the first line of README.md specially, and ended up setting the project description to "wordfreq is a Python library for looking up the frequencies of words in many". It'll be right in the next version.	2018-05-01 17:16:53 -04:00
Lance Nathan	316670a234	Merge pull request #56 from LuminosoInsight/japanese-edge-cases Handle Japanese edge cases in `simple_tokenize`	2018-05-01 14:57:45 -04:00
Robyn Speer	e0da20b0c4	update CHANGELOG for 2.0.1	2018-05-01 14:47:55 -04:00
Robyn Speer	666f7e51fa	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Lance Nathan	18f176dbf6	Merge pull request #55 from LuminosoInsight/version2 Version 2, with standalone text pre-processing	2018-03-15 14:26:49 -04:00
Robyn Speer	d9bc4af8cd	update the changelog	2018-03-14 17:56:29 -04:00
Robyn Speer	b2663272a7	remove LAUGHTER_WORDS, which is now unused This was a fun Twitter test, but we don't do that anymore	2018-03-14 17:33:35 -04:00
Robyn Speer	65811d587e	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
Robyn Speer	2ecf31ee81	Actually use `min_score` in `_language_in_list` We don't need to set it to any value but 80 now, but we will need to if we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and unified zh-Hani).	2018-03-14 15:08:52 -04:00
Robyn Speer	c57032d5cb	code review fixes to wordfreq.tokens	2018-03-14 15:07:45 -04:00
Robyn Speer	de81a23b9d	code review fixes to __init__	2018-03-14 15:04:59 -04:00
Robyn Speer	8656688b0b	fix mention of dependencies in README	2018-03-14 15:01:08 -04:00
Robyn Speer	d68d4baad2	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Robyn Speer	0cb36aa74f	cache the language info (avoids 10x slowdown)	2018-03-09 14:54:03 -05:00
Robyn Speer	b162de353d	avoid log spam: only warn about an unsupported language once	2018-03-09 11:50:15 -05:00
Robyn Speer	c5f64a5de8	update the README	2018-03-08 18:16:15 -05:00

1 2 3 4 5 ...

617 Commits