Commit Graph

51 Commits

Author SHA1 Message Date
Robyn Speer
174ecf580a update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00
Robyn Speer
becf94f767 update version and changelog 2020-04-28 15:24:24 -04:00
Robyn Speer
59f4a08920 packaging fix: require msgpack >= 1.0 2020-04-22 11:10:03 -04:00
Robyn Speer
3aeeeb64c7 use langcodes 2.0 and deprecate 'match_cutoff' 2020-04-16 14:09:30 -04:00
Lance Nathan
86e988b838 Fix code affected by a breaking change in msgpack 1.0
The msgpack readme explains: "Default value of strict_map_key is changed to
True to avoid hashdos. You need to pass strict_map_key=False if you have data
which contain map keys which type is not bytes or str."

chinese.py loads SIMPLIFIED_MAP from disk.  Since it is a str.translate
dictionary, its keys are numbers.  And since it's a dictionary we created
ourselves, there's no hashdos concern, so we can load it with
strict_map_key=False.
2020-02-28 13:02:45 -05:00
Robyn Speer
4cd7b4bada Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Robyn Speer
51ca052b62 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Rob Speer
0644c8920a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Rob Speer
93ddc192d8 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Rob Speer
aa91e1f291 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Rob Speer
3ec92a8952 Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Rob Speer
a42cf312ef setup: update version number and dependencies 2018-03-08 16:26:24 -05:00
Rob Speer
aadb19c9a3 bump version to 1.7.0, belatedly 2018-02-28 15:15:47 -05:00
Rob Speer
db56528fb6 update msgpack-python dependency to msgpack 2018-02-28 15:14:51 -05:00
Rob Speer
dcef5813b3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Rob Speer
37b4914970 v1.6.1: depend on langcodes 1.4 2017-05-10 13:26:23 -04:00
Rob Speer
803ebc25bb Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Rob Speer
752c90c8a5 eh, this is still version 1.5.2, not 1.6 2016-12-05 18:58:33 -05:00
Rob Speer
a92c805a82 fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Rob Speer
fb5a55de7e bump version to 1.5.1 2016-08-19 11:42:29 -04:00
Rob Speer
9758c69ff0 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
28028115c2 bump version to 1.4
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
slibs63
258f5088e9 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus

Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Sara Jewett
7b6f88b059 Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
9a1b00ba0c rebuild data files
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
91cc82f76d tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
1f5c828642 bump to version 1.1
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
f4cf46ab9c Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Rob Speer
4350bc3ed7 put back the freqs_to_cBpack cutoff; prepare for 1.0
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
19e74e91c6 declare 'mecab' as an extra
Former-commit-id: a69ea5ad52
2015-07-02 17:11:51 -04:00
Rob Speer
5d0d5f7cd2 declare that tests require mecab-python3
Former-commit-id: 7b4ebd1805
2015-07-02 11:29:11 -04:00
Rob Speer
66ad6f882e add Twitter-specific wordlists
Former-commit-id: 7e3066d3fc
2015-07-01 17:49:33 -04:00
Rob Speer
c9c7e49465 bump version number
Former-commit-id: 053f372ebc
2015-06-30 14:54:13 -04:00
Rob Speer
9a46b80028 clearer error on py2
Former-commit-id: ed19d79c5a
2015-05-28 14:05:11 -04:00
Rob Speer
51f4e4c826 add installation instructions to the readme
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Rob Speer
c953fc1626 update README, another setup fix
Former-commit-id: dd41e61c57
2015-05-13 04:09:34 -04:00
Rob Speer
5cbc0d0f94 update dependencies
Former-commit-id: f13cca4d81
2015-05-12 12:30:01 -04:00
Rob Speer
6f61cac4cb restore missing line in setup.py
Former-commit-id: bb18f741e2
2015-05-12 12:24:18 -04:00
Rob Speer
1c65cb9f14 add new data files from wordfreq_builder
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Rob Speer
9cd6f7c5c5 WIP: burn stuff down
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Rob Speer
732c932ac7 v0.7: make a proper Dutch 'surfaces' list
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Rob Speer
63b465c767 Don't download the DB if the right version is already there
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00
Rob Speer
8c3e8f9eb4 try being really nonspecific about functools32 versions
Former-commit-id: c1564908f2
2013-10-31 14:06:06 -04:00
Rob Speer
676cba640f be less specific about the functools32 version
Former-commit-id: 2542cf9e35
2013-10-31 14:02:40 -04:00
Rob Speer
40102a3f63 Normalize words when storing them or looking them up. 2013-10-30 14:59:57 -04:00
Lance
ce07c881c5 Another Py3 change, this one for functools32 2013-10-30 12:06:41 -04:00
Rob Speer
ca5b3e2f5d Implement the data uploady downloady stuff in setup. 2013-10-29 16:44:13 -04:00
Rob Speer
bc00bb3a8b prepare to write custom commands in setup.py 2013-10-29 12:43:41 -04:00