Commit Graph

534 Commits

Author SHA1 Message Date
Rob Speer
a05a1c8d5c mention tokenization change in changelog 2017-01-05 19:19:31 -05:00
Rob Speer
803ebc25bb Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Rob Speer
f9238ac30f update data from Exquisite Corpus in English and Swedish 2017-01-05 19:17:51 -05:00
Rob Speer
f671a1db7f import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Rob Speer
847b85c5b8 Merge branch 'transliterate-serbian' into all-1.6-changes 2017-01-05 17:57:52 -05:00
Rob Speer
e4f40a0ce9 transliterate: organize the 'borrowed letters' better 2017-01-05 13:23:20 -05:00
Rob Speer
99eac54b31 transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Rob Speer
6171b3d066 remove wordfreq_builder (obsoleted by exquisite-corpus) 2017-01-04 17:45:53 -05:00
Rob Speer
b3e5d1c9e9 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Rob Speer
d376f4e2e2 fixes to tokenization 2016-12-13 14:43:29 -05:00
Rob Speer
bb5df3b074 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00
Rob Speer
24e26c4c1d add a test for "aujourd'hui" 2016-12-06 17:39:40 -05:00
Rob Speer
d18b149262 Bake the 'h special case into the regex
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Rob Speer
752c90c8a5 eh, this is still version 1.5.2, not 1.6 2016-12-05 18:58:33 -05:00
Rob Speer
f285430c84 add a specific test in Catalan 2016-12-05 18:54:51 -05:00
Rob Speer
02e2430dfb add tests for French apostrophe tokenization 2016-12-05 18:54:51 -05:00
Rob Speer
a92c805a82 fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Lance Nathan
f6f0914e81 Merge pull request #45 from LuminosoInsight/citation
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Rob Speer
872eeb8848 Describe how to cite wordfreq
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Rob Speer
0ba563c99c Add a changelog 2016-08-22 12:41:39 -04:00
Andrew Lin
91f7ef37eb Merge pull request #44 from LuminosoInsight/mecab-loading-fix
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Rob Speer
fb5a55de7e bump version to 1.5.1 2016-08-19 11:42:29 -04:00
Rob Speer
31be4fd309 Allow MeCab to work in Japanese or Korean without the other 2016-08-19 11:41:35 -04:00
Andrew Lin
0250547c7a Merge pull request #42 from LuminosoInsight/mecab-finder
Look for MeCab dictionaries in various places besides this package

Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Rob Speer
8c79465d28 Remove unnecessary variable from make_mecab_analyzer
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Rob Speer
0a5e6bd87a consolidate logic about MeCab path length
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Rob Speer
09a904c0fe Getting a newer mecab-ko-dic changed the Korean frequencies
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Rob Speer
c6c44939e6 update find_mecab_dictionary docstring
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Rob Speer
188654396a remove my ad-hoc names for dictionary packages
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Rob Speer
1519df503c stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Rob Speer
410e8c255b fix MeCab error message
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Rob Speer
c1927732d3 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Rob Speer
1aa63bca6c Make the almost-median deterministic when it rounds down to 0
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00
Rob Speer
fcbdf560c2 Code review fixes: avoid repeatedly constructing sets
Former-commit-id: 1a16b0f84c
2016-07-29 12:32:26 -04:00
Rob Speer
99b627a300 Revise multilingual tests
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Rob Speer
9758c69ff0 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
f539eecdd6 wordfreq_builder: Document the extract_reddit pipeline
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Andrew Lin
6eaae696fe Merge pull request #35 from LuminosoInsight/big-list-test-fix
fix Arabic test, where 'lol' is no longer common

Former-commit-id: 3a6d985203
2016-05-11 17:20:01 -04:00
Rob Speer
c3fd3bd734 fix Arabic test, where 'lol' is no longer common
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Andrew Lin
3c2a621743 Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection

Former-commit-id: e7b34fb655
2016-05-11 16:27:51 -04:00
Rob Speer
4e4c77e7d7 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Rob Speer
c5bdc3c6bd limit Reddit data to just English
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Rob Speer
6f11256ed1 remove reddit_base_filename function
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Rob Speer
d924c8e2a5 use path.stem to make the Reddit filename prefix
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Rob Speer
9adc5b92f8 rename max_size to max_words consistently
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Rob Speer
f4aa2cad7b fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Rob Speer
758e37af07 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Rob Speer
c82073270b update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00