Rob Speer
d6cdef6039
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Rob Speer
97042e6f60
Merge pull request #48 from LuminosoInsight/code-review-notes
...
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
f28a193015
Clarify the changelog.
2017-02-14 13:09:12 -05:00
Andrew Lin
e21bcc2a58
Correct a case in transliterate.py.
2017-02-14 13:08:23 -05:00
Andrew Lin
21b331e898
Merge pull request #47 from LuminosoInsight/all-1.6-changes
...
All 1.6 changes
2017-02-01 15:36:38 -05:00
Rob Speer
b5b653f0a1
Remove ninja2dot script, which is no longer used
2017-02-01 14:49:44 -05:00
Rob Speer
391a723662
describe the current problem with 'cyrtranslit' as a dependency
2017-01-31 18:25:52 -05:00
Rob Speer
7fa5e7fc22
Fix some outdated numbers in English examples
2017-01-31 18:25:41 -05:00
Rob Speer
68e4ce16cf
Handle smashing numbers only at the end of tokenize().
...
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Rob Speer
e6114bf0fa
Update README with new examples and URL
2017-01-09 15:13:19 -05:00
Rob Speer
f03a37e19c
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Rob Speer
4dfa800cd8
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Rob Speer
d2bb5b78f3
update the README, citing OpenSubtitles 2016
2017-01-06 19:04:40 -05:00
Rob Speer
3f9c8449ff
Mention that multi-digit numbers are combined together
2017-01-05 19:24:28 -05:00
Rob Speer
a05a1c8d5c
mention tokenization change in changelog
2017-01-05 19:19:31 -05:00
Rob Speer
803ebc25bb
Update documentation and bump version to 1.6
2017-01-05 19:18:06 -05:00
Rob Speer
f9238ac30f
update data from Exquisite Corpus in English and Swedish
2017-01-05 19:17:51 -05:00
Rob Speer
f671a1db7f
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Rob Speer
847b85c5b8
Merge branch 'transliterate-serbian' into all-1.6-changes
2017-01-05 17:57:52 -05:00
Rob Speer
e4f40a0ce9
transliterate: organize the 'borrowed letters' better
2017-01-05 13:23:20 -05:00
Rob Speer
99eac54b31
transliterate: Handle unexpected Russian invasions
2017-01-04 18:51:00 -05:00
Rob Speer
6171b3d066
remove wordfreq_builder (obsoleted by exquisite-corpus)
2017-01-04 17:45:53 -05:00
Rob Speer
b3e5d1c9e9
Add transliteration of Cyrillic Serbian
2016-12-29 18:27:17 -05:00
Rob Speer
d376f4e2e2
fixes to tokenization
2016-12-13 14:43:29 -05:00
Rob Speer
bb5df3b074
Replace multi-digit sequences with zeroes
2016-12-09 15:55:08 -05:00
Rob Speer
24e26c4c1d
add a test for "aujourd'hui"
2016-12-06 17:39:40 -05:00
Rob Speer
d18b149262
Bake the 'h special case into the regex
...
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Rob Speer
752c90c8a5
eh, this is still version 1.5.2, not 1.6
2016-12-05 18:58:33 -05:00
Rob Speer
f285430c84
add a specific test in Catalan
2016-12-05 18:54:51 -05:00
Rob Speer
02e2430dfb
add tests for French apostrophe tokenization
2016-12-05 18:54:51 -05:00
Rob Speer
a92c805a82
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00
Lance Nathan
f6f0914e81
Merge pull request #45 from LuminosoInsight/citation
...
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Rob Speer
872eeb8848
Describe how to cite wordfreq
...
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Rob Speer
0ba563c99c
Add a changelog
2016-08-22 12:41:39 -04:00
Andrew Lin
91f7ef37eb
Merge pull request #44 from LuminosoInsight/mecab-loading-fix
...
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Rob Speer
fb5a55de7e
bump version to 1.5.1
2016-08-19 11:42:29 -04:00
Rob Speer
31be4fd309
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:41:35 -04:00
Andrew Lin
0250547c7a
Merge pull request #42 from LuminosoInsight/mecab-finder
...
Look for MeCab dictionaries in various places besides this package
Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Rob Speer
8c79465d28
Remove unnecessary variable from make_mecab_analyzer
...
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Rob Speer
0a5e6bd87a
consolidate logic about MeCab path length
...
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Rob Speer
09a904c0fe
Getting a newer mecab-ko-dic changed the Korean frequencies
...
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Rob Speer
c6c44939e6
update find_mecab_dictionary docstring
...
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Rob Speer
188654396a
remove my ad-hoc names for dictionary packages
...
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Rob Speer
1519df503c
stop including MeCab dictionaries in the package
...
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Rob Speer
410e8c255b
fix MeCab error message
...
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Rob Speer
c1927732d3
Look for MeCab dictionaries in various places besides this package
...
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Rob Speer
1aa63bca6c
Make the almost-median deterministic when it rounds down to 0
...
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00
Rob Speer
fcbdf560c2
Code review fixes: avoid repeatedly constructing sets
...
Former-commit-id: 1a16b0f84c
2016-07-29 12:32:26 -04:00
Rob Speer
99b627a300
Revise multilingual tests
...
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00