Commit Graph

7 Commits

Author SHA1 Message Date
Joshua Adelman
dab4c8da2a
Include license file in source distribution 2021-10-19 15:30:59 -04:00
Rob Speer
a0893af82e Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Andrew Lin
c53bb06988 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 65d6645e81 [formerly db41bc7902].


Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
65d6645e81 Remove the no-longer-existent .txt files from the MANIFEST.
Former-commit-id: db41bc7902
2015-09-02 14:27:15 -04:00
Joshua Chin
b510e4144d removes combining marks from arabic words instead of treating them as punctuation
Former-commit-id: cebca52ea3
2015-06-25 12:36:41 -04:00
Joshua Chin
0a30164358 added non_punct to MANIFEST.in and moved it into data
Former-commit-id: b198f4b0c2
2015-06-24 17:30:01 -04:00
Rob Speer
1c65cb9f14 add new data files from wordfreq_builder
Former-commit-id: 35aec061de
2015-05-11 18:45:47 -04:00