Robyn Speer
a4d9614e39
setup: update version number and dependencies
2018-03-08 16:26:24 -05:00
Robyn Speer
208559ae1e
bump version to 1.7.0, belatedly
2018-02-28 15:15:47 -05:00
Robyn Speer
98cb47c774
update msgpack-python dependency to msgpack
2018-02-28 15:14:51 -05:00
Robyn Speer
9dac967ca3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Robyn Speer
aa3ed23282
v1.6.1: depend on langcodes 1.4
2017-05-10 13:26:23 -04:00
Robyn Speer
39e459ac71
Update documentation and bump version to 1.6
2017-01-05 19:18:06 -05:00
Robyn Speer
82eba05f2d
eh, this is still version 1.5.2, not 1.6
2016-12-05 18:58:33 -05:00
Robyn Speer
596368ac6e
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00
Robyn Speer
aa880bcd84
bump version to 1.5.1
2016-08-19 11:42:29 -04:00
Robyn Speer
2a41d4dc5e
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
460fbb84fd
bump version to 1.4
...
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
slibs63
927d4f45a4
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Sara Jewett
42d209cbe2
Specify encoding when dealing with files
...
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Robyn Speer
23949a4512
rebuild data files
...
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
6f10e71d29
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Robyn Speer
8795525372
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Robyn Speer
3ff0f30218
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Robyn Speer
090cfa7088
declare 'mecab' as an extra
...
Former-commit-id: a69ea5ad52
2015-07-02 17:11:51 -04:00
Robyn Speer
83939020d0
declare that tests require mecab-python3
...
Former-commit-id: 7b4ebd1805
2015-07-02 11:29:11 -04:00
Robyn Speer
215eafc50b
add Twitter-specific wordlists
...
Former-commit-id: 7e3066d3fc
2015-07-01 17:49:33 -04:00
Robyn Speer
4c2b766f46
bump version number
...
Former-commit-id: 053f372ebc
2015-06-30 14:54:13 -04:00
Robyn Speer
2dc3d82a98
clearer error on py2
...
Former-commit-id: ed19d79c5a
2015-05-28 14:05:11 -04:00
Robyn Speer
a3cc8d403c
add installation instructions to the readme
...
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Robyn Speer
7c6cf84749
update README, another setup fix
...
Former-commit-id: dd41e61c57
2015-05-13 04:09:34 -04:00
Robyn Speer
c1edefa419
update dependencies
...
Former-commit-id: f13cca4d81
2015-05-12 12:30:01 -04:00
Robyn Speer
fd4df8d1eb
restore missing line in setup.py
...
Former-commit-id: bb18f741e2
2015-05-12 12:24:18 -04:00
Robyn Speer
aa0e844b81
add new data files from wordfreq_builder
...
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Robyn Speer
f92598b13d
WIP: burn stuff down
...
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Robyn Speer
cb6b2a8002
v0.7: make a proper Dutch 'surfaces' list
...
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Robyn Speer
351378e318
Don't download the DB if the right version is already there
...
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00
Robyn Speer
16bc844841
try being really nonspecific about functools32 versions
...
Former-commit-id: c1564908f2
2013-10-31 14:06:06 -04:00
Robyn Speer
8690ac3f57
be less specific about the functools32 version
...
Former-commit-id: 2542cf9e35
2013-10-31 14:02:40 -04:00
Robyn Speer
8f00846117
Normalize words when storing them or looking them up.
2013-10-30 14:59:57 -04:00
Lance
74cfb69f5a
Another Py3 change, this one for functools32
2013-10-30 12:06:41 -04:00
Robyn Speer
a95d88d1b9
Implement the data uploady downloady stuff in setup.
2013-10-29 16:44:13 -04:00
Robyn Speer
36344d3737
prepare to write custom commands in setup.py
2013-10-29 12:43:41 -04:00
Robyn Speer
e8273e47a1
Initial version.
...
Noticeably missing: data files or any way to get them.
2013-10-28 19:26:44 -04:00