Rob Speer
a92c805a82
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00
Rob Speer
fb5a55de7e
bump version to 1.5.1
2016-08-19 11:42:29 -04:00
Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
28028115c2
bump version to 1.4
...
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
slibs63
258f5088e9
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Sara Jewett
7b6f88b059
Specify encoding when dealing with files
...
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
9a1b00ba0c
rebuild data files
...
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
1f5c828642
bump to version 1.1
...
Former-commit-id: 694c28d5e4
2015-08-25 17:44:52 -04:00
Rob Speer
f4cf46ab9c
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Rob Speer
4350bc3ed7
put back the freqs_to_cBpack cutoff; prepare for 1.0
...
Former-commit-id: c5708b24e4
2015-07-28 18:01:12 -04:00
Rob Speer
19e74e91c6
declare 'mecab' as an extra
...
Former-commit-id: a69ea5ad52
2015-07-02 17:11:51 -04:00
Rob Speer
5d0d5f7cd2
declare that tests require mecab-python3
...
Former-commit-id: 7b4ebd1805
2015-07-02 11:29:11 -04:00
Rob Speer
66ad6f882e
add Twitter-specific wordlists
...
Former-commit-id: 7e3066d3fc
2015-07-01 17:49:33 -04:00
Rob Speer
c9c7e49465
bump version number
...
Former-commit-id: 053f372ebc
2015-06-30 14:54:13 -04:00
Rob Speer
9a46b80028
clearer error on py2
...
Former-commit-id: ed19d79c5a
2015-05-28 14:05:11 -04:00
Rob Speer
51f4e4c826
add installation instructions to the readme
...
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Rob Speer
c953fc1626
update README, another setup fix
...
Former-commit-id: dd41e61c57
2015-05-13 04:09:34 -04:00
Rob Speer
5cbc0d0f94
update dependencies
...
Former-commit-id: f13cca4d81
2015-05-12 12:30:01 -04:00
Rob Speer
6f61cac4cb
restore missing line in setup.py
...
Former-commit-id: bb18f741e2
2015-05-12 12:24:18 -04:00
Rob Speer
1c65cb9f14
add new data files from wordfreq_builder
...
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00
Rob Speer
9cd6f7c5c5
WIP: burn stuff down
...
Former-commit-id: 9b63e54471
2015-05-08 15:28:52 -04:00
Rob Speer
732c932ac7
v0.7: make a proper Dutch 'surfaces' list
...
Former-commit-id: 873ace87db
2015-04-30 13:01:24 -04:00
Rob Speer
63b465c767
Don't download the DB if the right version is already there
...
Former-commit-id: e931062b5a
2013-10-31 14:12:04 -04:00
Rob Speer
8c3e8f9eb4
try being really nonspecific about functools32 versions
...
Former-commit-id: c1564908f2
2013-10-31 14:06:06 -04:00
Rob Speer
676cba640f
be less specific about the functools32 version
...
Former-commit-id: 2542cf9e35
2013-10-31 14:02:40 -04:00
Rob Speer
40102a3f63
Normalize words when storing them or looking them up.
2013-10-30 14:59:57 -04:00
Lance
ce07c881c5
Another Py3 change, this one for functools32
2013-10-30 12:06:41 -04:00
Rob Speer
ca5b3e2f5d
Implement the data uploady downloady stuff in setup.
2013-10-29 16:44:13 -04:00
Rob Speer
bc00bb3a8b
prepare to write custom commands in setup.py
2013-10-29 12:43:41 -04:00
Rob Speer
709ca6be66
Initial version.
...
Noticeably missing: data files or any way to get them.
2013-10-28 19:26:44 -04:00