Commit Graph

551 Commits

Author SHA1 Message Date
Robyn Speer
9dac967ca3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
6c118c0b6a Merge pull request #49 from LuminosoInsight/restore-langcodes
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Robyn Speer
aa3ed23282 v1.6.1: depend on langcodes 1.4 2017-05-10 13:26:23 -04:00
Robyn Speer
71a0ad6abb Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Robyn Speer
ae7bc5764b Merge pull request #48 from LuminosoInsight/code-review-notes
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
c2e1504643 Clarify the changelog. 2017-02-14 13:09:12 -05:00
Andrew Lin
1363f9d2e0 Correct a case in transliterate.py. 2017-02-14 13:08:23 -05:00
Andrew Lin
72e3678e89 Merge pull request #47 from LuminosoInsight/all-1.6-changes
All 1.6 changes
2017-02-01 15:36:38 -05:00
Robyn Speer
a099a5a881 Remove ninja2dot script, which is no longer used 2017-02-01 14:49:44 -05:00
Robyn Speer
7dec335f74 describe the current problem with 'cyrtranslit' as a dependency 2017-01-31 18:25:52 -05:00
Robyn Speer
19b72132e7 Fix some outdated numbers in English examples 2017-01-31 18:25:41 -05:00
Robyn Speer
abd0820a32 Handle smashing numbers only at the end of tokenize().
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Robyn Speer
93306e55a0 Update README with new examples and URL 2017-01-09 15:13:19 -05:00
Robyn Speer
9a6beb0089 test that number-smashing still happens in freq lookups 2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0 Don't smash numbers in *all* tokenization, just when looking up freqs
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
3cb3c38f47 update the README, citing OpenSubtitles 2016 2017-01-06 19:04:40 -05:00
Robyn Speer
86f22e8523 Mention that multi-digit numbers are combined together 2017-01-05 19:24:28 -05:00
Robyn Speer
48a5967e9a mention tokenization change in changelog 2017-01-05 19:19:31 -05:00
Robyn Speer
39e459ac71 Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Robyn Speer
23c7c8e936 update data from Exquisite Corpus in English and Swedish 2017-01-05 19:17:51 -05:00
Robyn Speer
7dc3f03ebd import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Robyn Speer
de32a15b4f Merge branch 'transliterate-serbian' into all-1.6-changes 2017-01-05 17:57:52 -05:00
Robyn Speer
d66d04210f transliterate: organize the 'borrowed letters' better 2017-01-05 13:23:20 -05:00
Robyn Speer
87b03325db transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Robyn Speer
c27e7f9b76 remove wordfreq_builder (obsoleted by exquisite-corpus) 2017-01-04 17:45:53 -05:00
Robyn Speer
6211b35fb3 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Robyn Speer
0aa7ad46ae fixes to tokenization 2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00
Robyn Speer
a8e2fa5acf add a test for "aujourd'hui" 2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9 Bake the 'h special case into the regex
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
82eba05f2d eh, this is still version 1.5.2, not 1.6 2016-12-05 18:58:33 -05:00
Robyn Speer
4376636316 add a specific test in Catalan 2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65 add tests for French apostrophe tokenization 2016-12-05 18:54:51 -05:00
Robyn Speer
596368ac6e fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Lance Nathan
7f26270644 Merge pull request #45 from LuminosoInsight/citation
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Robyn Speer
7fabbfef31 Describe how to cite wordfreq
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Robyn Speer
c0fbd844f6 Add a changelog 2016-08-22 12:41:39 -04:00
Andrew Lin
976c8df0fd Merge pull request #44 from LuminosoInsight/mecab-loading-fix
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Robyn Speer
aa880bcd84 bump version to 1.5.1 2016-08-19 11:42:29 -04:00
Robyn Speer
e1d6e7d96f Allow MeCab to work in Japanese or Korean without the other 2016-08-19 11:41:35 -04:00
Andrew Lin
e4b32afa18 Merge pull request #42 from LuminosoInsight/mecab-finder
Look for MeCab dictionaries in various places besides this package

Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Robyn Speer
88c93f6204 Remove unnecessary variable from make_mecab_analyzer
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Robyn Speer
6440d81676 consolidate logic about MeCab path length
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Robyn Speer
c11998e506 Getting a newer mecab-ko-dic changed the Korean frequencies
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Robyn Speer
bc1cfc35c8 update find_mecab_dictionary docstring
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Robyn Speer
9e55f8fed1 remove my ad-hoc names for dictionary packages
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Robyn Speer
2787bfd647 stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
875dd5669f fix MeCab error message
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Robyn Speer
94712c8312 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
ce5a91d732 Make the almost-median deterministic when it rounds down to 0
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00