Commit Graph

57 Commits

Author SHA1 Message Date
Robyn Speer
00e60df106 Merge branch 'master' into data-update-2.5 2021-03-29 16:42:24 -04:00
Robyn Speer
fc5c4cdda8 small documentation fixes 2021-03-29 16:41:47 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Robyn Speer
168bb2a6ed fix version, update instructions and changelog 2021-02-18 18:25:16 -05:00
Robyn Speer
de636a804e Use Python packages to find dictionaries for MeCab 2021-02-18 18:18:06 -05:00
Robyn Speer
ad3a5c533f work with langcodes 3.0, without language_data 2021-02-09 17:27:22 -05:00
Robyn Speer
c8229a5378 update the changelog 2020-10-01 16:12:41 -04:00
Robyn Speer
0ff812a711 update version and changelog 2020-04-28 15:24:24 -04:00
Robyn Speer
26b4175f3b packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Robyn Speer
bf795e6d6c use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Lance Nathan
45a002c1e1 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Robyn Speer
d30183a7d7 Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Robyn Speer
563e8f7444 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Robyn Speer
f73406c69a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Robyn Speer
4b7e3d9655 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Robyn Speer
8907423147 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Robyn Speer
666f7e51fa Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Robyn Speer
a4d9614e39 setup: update version number and dependencies 2018-03-08 16:26:24 -05:00
Robyn Speer
208559ae1e bump version to 1.7.0, belatedly 2018-02-28 15:15:47 -05:00
Robyn Speer
98cb47c774 update msgpack-python dependency to msgpack 2018-02-28 15:14:51 -05:00
Robyn Speer
9dac967ca3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Robyn Speer
aa3ed23282 v1.6.1: depend on langcodes 1.4 2017-05-10 13:26:23 -04:00
Robyn Speer
39e459ac71 Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Robyn Speer
82eba05f2d eh, this is still version 1.5.2, not 1.6 2016-12-05 18:58:33 -05:00
Robyn Speer
596368ac6e fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Robyn Speer
aa880bcd84 bump version to 1.5.1 2016-08-19 11:42:29 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
460fbb84fd bump version to 1.4
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
slibs63
927d4f45a4 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus

Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Sara Jewett
42d209cbe2 Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Robyn Speer
23949a4512 rebuild data files
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
6f10e71d29 bump to version 1.1
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Robyn Speer
8795525372 Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Robyn Speer
3ff0f30218 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
090cfa7088 declare 'mecab' as an extra
Former-commit-id: a69ea5ad52
2015-07-02 17:11:51 -04:00
Robyn Speer
83939020d0 declare that tests require mecab-python3
Former-commit-id: 7b4ebd1805
2015-07-02 11:29:11 -04:00
Robyn Speer
215eafc50b add Twitter-specific wordlists
Former-commit-id: 7e3066d3fc
2015-07-01 17:49:33 -04:00
Robyn Speer
4c2b766f46 bump version number
Former-commit-id: 053f372ebc
2015-06-30 14:54:13 -04:00
Robyn Speer
2dc3d82a98 clearer error on py2
Former-commit-id: ed19d79c5a
2015-05-28 14:05:11 -04:00
Robyn Speer
a3cc8d403c add installation instructions to the readme
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Robyn Speer
7c6cf84749 update README, another setup fix
Former-commit-id: dd41e61c57
2015-05-13 04:09:34 -04:00
Robyn Speer
c1edefa419 update dependencies
Former-commit-id: f13cca4d81
2015-05-12 12:30:01 -04:00
Robyn Speer
fd4df8d1eb restore missing line in setup.py
Former-commit-id: bb18f741e2
2015-05-12 12:24:18 -04:00
Robyn Speer
aa0e844b81 add new data files from wordfreq_builder
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Robyn Speer
f92598b13d WIP: burn stuff down
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Robyn Speer
cb6b2a8002 v0.7: make a proper Dutch 'surfaces' list
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Robyn Speer
351378e318 Don't download the DB if the right version is already there
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00