Rob Speer
db56528fb6
update msgpack-python dependency to msgpack
2018-02-28 15:14:51 -05:00
Rob Speer
843ed92223
update citation to v1.7
2017-09-27 13:36:30 -04:00
Andrew Lin
721a1e9fd9
Merge pull request #51 from LuminosoInsight/version1.7
...
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Rob Speer
61b2e4062d
remove unnecessary enumeration from top_n.py
2017-09-08 16:52:06 -04:00
Rob Speer
396b0f78df
update README for 1.7; sort language list in English order
2017-08-25 17:38:31 -04:00
Rob Speer
e3352392cc
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Rob Speer
dcef5813b3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
baf6771e97
Merge pull request #49 from LuminosoInsight/restore-langcodes
...
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Rob Speer
37b4914970
v1.6.1: depend on langcodes 1.4
2017-05-10 13:26:23 -04:00
Rob Speer
d6cdef6039
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Rob Speer
97042e6f60
Merge pull request #48 from LuminosoInsight/code-review-notes
...
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
f28a193015
Clarify the changelog.
2017-02-14 13:09:12 -05:00
Andrew Lin
e21bcc2a58
Correct a case in transliterate.py.
2017-02-14 13:08:23 -05:00
Andrew Lin
21b331e898
Merge pull request #47 from LuminosoInsight/all-1.6-changes
...
All 1.6 changes
2017-02-01 15:36:38 -05:00
Rob Speer
b5b653f0a1
Remove ninja2dot script, which is no longer used
2017-02-01 14:49:44 -05:00
Rob Speer
391a723662
describe the current problem with 'cyrtranslit' as a dependency
2017-01-31 18:25:52 -05:00
Rob Speer
7fa5e7fc22
Fix some outdated numbers in English examples
2017-01-31 18:25:41 -05:00
Rob Speer
68e4ce16cf
Handle smashing numbers only at the end of tokenize().
...
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Rob Speer
e6114bf0fa
Update README with new examples and URL
2017-01-09 15:13:19 -05:00
Rob Speer
f03a37e19c
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Rob Speer
4dfa800cd8
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Rob Speer
d2bb5b78f3
update the README, citing OpenSubtitles 2016
2017-01-06 19:04:40 -05:00
Rob Speer
3f9c8449ff
Mention that multi-digit numbers are combined together
2017-01-05 19:24:28 -05:00
Rob Speer
a05a1c8d5c
mention tokenization change in changelog
2017-01-05 19:19:31 -05:00
Rob Speer
803ebc25bb
Update documentation and bump version to 1.6
2017-01-05 19:18:06 -05:00
Rob Speer
f9238ac30f
update data from Exquisite Corpus in English and Swedish
2017-01-05 19:17:51 -05:00
Rob Speer
f671a1db7f
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Rob Speer
847b85c5b8
Merge branch 'transliterate-serbian' into all-1.6-changes
2017-01-05 17:57:52 -05:00
Rob Speer
e4f40a0ce9
transliterate: organize the 'borrowed letters' better
2017-01-05 13:23:20 -05:00
Rob Speer
99eac54b31
transliterate: Handle unexpected Russian invasions
2017-01-04 18:51:00 -05:00
Rob Speer
6171b3d066
remove wordfreq_builder (obsoleted by exquisite-corpus)
2017-01-04 17:45:53 -05:00
Rob Speer
b3e5d1c9e9
Add transliteration of Cyrillic Serbian
2016-12-29 18:27:17 -05:00
Rob Speer
d376f4e2e2
fixes to tokenization
2016-12-13 14:43:29 -05:00
Rob Speer
bb5df3b074
Replace multi-digit sequences with zeroes
2016-12-09 15:55:08 -05:00
Rob Speer
24e26c4c1d
add a test for "aujourd'hui"
2016-12-06 17:39:40 -05:00
Rob Speer
d18b149262
Bake the 'h special case into the regex
...
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Rob Speer
752c90c8a5
eh, this is still version 1.5.2, not 1.6
2016-12-05 18:58:33 -05:00
Rob Speer
f285430c84
add a specific test in Catalan
2016-12-05 18:54:51 -05:00
Rob Speer
02e2430dfb
add tests for French apostrophe tokenization
2016-12-05 18:54:51 -05:00
Rob Speer
a92c805a82
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00
Lance Nathan
f6f0914e81
Merge pull request #45 from LuminosoInsight/citation
...
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Rob Speer
872eeb8848
Describe how to cite wordfreq
...
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Rob Speer
0ba563c99c
Add a changelog
2016-08-22 12:41:39 -04:00
Andrew Lin
91f7ef37eb
Merge pull request #44 from LuminosoInsight/mecab-loading-fix
...
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Rob Speer
fb5a55de7e
bump version to 1.5.1
2016-08-19 11:42:29 -04:00
Rob Speer
31be4fd309
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:41:35 -04:00
Andrew Lin
0250547c7a
Merge pull request #42 from LuminosoInsight/mecab-finder
...
Look for MeCab dictionaries in various places besides this package
Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Rob Speer
8c79465d28
Remove unnecessary variable from make_mecab_analyzer
...
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Rob Speer
0a5e6bd87a
consolidate logic about MeCab path length
...
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Rob Speer
09a904c0fe
Getting a newer mecab-ko-dic changed the Korean frequencies
...
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00