Robyn Speer
573ecc53d0
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
3cb3c38f47
update the README, citing OpenSubtitles 2016
2017-01-06 19:04:40 -05:00
Robyn Speer
86f22e8523
Mention that multi-digit numbers are combined together
2017-01-05 19:24:28 -05:00
Robyn Speer
48a5967e9a
mention tokenization change in changelog
2017-01-05 19:19:31 -05:00
Robyn Speer
39e459ac71
Update documentation and bump version to 1.6
2017-01-05 19:18:06 -05:00
Robyn Speer
23c7c8e936
update data from Exquisite Corpus in English and Swedish
2017-01-05 19:17:51 -05:00
Robyn Speer
7dc3f03ebd
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Robyn Speer
de32a15b4f
Merge branch 'transliterate-serbian' into all-1.6-changes
2017-01-05 17:57:52 -05:00
Robyn Speer
d66d04210f
transliterate: organize the 'borrowed letters' better
2017-01-05 13:23:20 -05:00
Robyn Speer
87b03325db
transliterate: Handle unexpected Russian invasions
2017-01-04 18:51:00 -05:00
Robyn Speer
c27e7f9b76
remove wordfreq_builder (obsoleted by exquisite-corpus)
2017-01-04 17:45:53 -05:00
Robyn Speer
6211b35fb3
Add transliteration of Cyrillic Serbian
2016-12-29 18:27:17 -05:00
Robyn Speer
0aa7ad46ae
fixes to tokenization
2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74
Replace multi-digit sequences with zeroes
2016-12-09 15:55:08 -05:00
Robyn Speer
a8e2fa5acf
add a test for "aujourd'hui"
2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9
Bake the 'h special case into the regex
...
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
82eba05f2d
eh, this is still version 1.5.2, not 1.6
2016-12-05 18:58:33 -05:00
Robyn Speer
4376636316
add a specific test in Catalan
2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65
add tests for French apostrophe tokenization
2016-12-05 18:54:51 -05:00
Robyn Speer
596368ac6e
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00
Lance Nathan
7f26270644
Merge pull request #45 from LuminosoInsight/citation
...
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Robyn Speer
7fabbfef31
Describe how to cite wordfreq
...
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Robyn Speer
c0fbd844f6
Add a changelog
2016-08-22 12:41:39 -04:00
Andrew Lin
976c8df0fd
Merge pull request #44 from LuminosoInsight/mecab-loading-fix
...
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Robyn Speer
aa880bcd84
bump version to 1.5.1
2016-08-19 11:42:29 -04:00
Robyn Speer
e1d6e7d96f
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:41:35 -04:00
Andrew Lin
e4b32afa18
Merge pull request #42 from LuminosoInsight/mecab-finder
...
Look for MeCab dictionaries in various places besides this package
Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Robyn Speer
88c93f6204
Remove unnecessary variable from make_mecab_analyzer
...
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Robyn Speer
6440d81676
consolidate logic about MeCab path length
...
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Robyn Speer
c11998e506
Getting a newer mecab-ko-dic changed the Korean frequencies
...
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Robyn Speer
bc1cfc35c8
update find_mecab_dictionary docstring
...
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Robyn Speer
9e55f8fed1
remove my ad-hoc names for dictionary packages
...
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Robyn Speer
2787bfd647
stop including MeCab dictionaries in the package
...
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
875dd5669f
fix MeCab error message
...
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Robyn Speer
94712c8312
Look for MeCab dictionaries in various places besides this package
...
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
ce5a91d732
Make the almost-median deterministic when it rounds down to 0
...
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00
Robyn Speer
15667ea023
Code review fixes: avoid repeatedly constructing sets
...
Former-commit-id: 1a16b0f84c
2016-07-29 12:32:26 -04:00
Robyn Speer
68c6d95131
Revise multilingual tests
...
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Robyn Speer
2a41d4dc5e
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
8d09b68d37
wordfreq_builder: Document the extract_reddit pipeline
...
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Andrew Lin
046ca4cda3
Merge pull request #35 from LuminosoInsight/big-list-test-fix
...
fix Arabic test, where 'lol' is no longer common
Former-commit-id: 3a6d985203
2016-05-11 17:20:01 -04:00
Robyn Speer
c72326e4c0
fix Arabic test, where 'lol' is no longer common
...
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Andrew Lin
7a55e0ed86
Merge pull request #34 from LuminosoInsight/big-list
...
wordfreq 1.4: some bigger wordlists, better use of language detection
Former-commit-id: e7b34fb655
2016-05-11 16:27:51 -04:00
Robyn Speer
1ac6795709
fix to README: we're only using Reddit in English
...
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Robyn Speer
a0d93e0ce8
limit Reddit data to just English
...
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Robyn Speer
5a37cc22c7
remove reddit_base_filename function
...
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Robyn Speer
797895047a
use path.stem
to make the Reddit filename prefix
...
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Robyn Speer
a2bc90e430
rename max_size to max_words consistently
...
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00