Commit Graph

475 Commits

Author SHA1 Message Date
Robyn Speer
0aa7ad46ae fixes to tokenization 2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00
Robyn Speer
a8e2fa5acf add a test for "aujourd'hui" 2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9 Bake the 'h special case into the regex
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
82eba05f2d eh, this is still version 1.5.2, not 1.6 2016-12-05 18:58:33 -05:00
Robyn Speer
4376636316 add a specific test in Catalan 2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65 add tests for French apostrophe tokenization 2016-12-05 18:54:51 -05:00
Robyn Speer
596368ac6e fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Lance Nathan
7f26270644 Merge pull request #45 from LuminosoInsight/citation
Describe how to cite wordfreq
2016-09-12 18:34:55 -04:00
Robyn Speer
7fabbfef31 Describe how to cite wordfreq
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Robyn Speer
c0fbd844f6 Add a changelog 2016-08-22 12:41:39 -04:00
Andrew Lin
976c8df0fd Merge pull request #44 from LuminosoInsight/mecab-loading-fix
Allow MeCab to work in Japanese or Korean without the other
2016-08-19 11:59:44 -04:00
Robyn Speer
aa880bcd84 bump version to 1.5.1 2016-08-19 11:42:29 -04:00
Robyn Speer
e1d6e7d96f Allow MeCab to work in Japanese or Korean without the other 2016-08-19 11:41:35 -04:00
Andrew Lin
e4b32afa18 Merge pull request #42 from LuminosoInsight/mecab-finder
Look for MeCab dictionaries in various places besides this package

Former-commit-id: 6f97d5ac099e9bb1a13ccfe114e0f4a79fb9b628
2016-08-08 16:00:39 -04:00
Robyn Speer
88c93f6204 Remove unnecessary variable from make_mecab_analyzer
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Robyn Speer
6440d81676 consolidate logic about MeCab path length
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Robyn Speer
c11998e506 Getting a newer mecab-ko-dic changed the Korean frequencies
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Robyn Speer
bc1cfc35c8 update find_mecab_dictionary docstring
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Robyn Speer
9e55f8fed1 remove my ad-hoc names for dictionary packages
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Robyn Speer
2787bfd647 stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
875dd5669f fix MeCab error message
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Robyn Speer
94712c8312 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
ce5a91d732 Make the almost-median deterministic when it rounds down to 0
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00
Robyn Speer
15667ea023 Code review fixes: avoid repeatedly constructing sets
Former-commit-id: 1a16b0f84c
2016-07-29 12:32:26 -04:00
Robyn Speer
68c6d95131 Revise multilingual tests
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
8d09b68d37 wordfreq_builder: Document the extract_reddit pipeline
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Andrew Lin
046ca4cda3 Merge pull request #35 from LuminosoInsight/big-list-test-fix
fix Arabic test, where 'lol' is no longer common

Former-commit-id: 3a6d985203
2016-05-11 17:20:01 -04:00
Robyn Speer
c72326e4c0 fix Arabic test, where 'lol' is no longer common
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Andrew Lin
7a55e0ed86 Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection

Former-commit-id: e7b34fb655
2016-05-11 16:27:51 -04:00
Robyn Speer
1ac6795709 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Robyn Speer
a0d93e0ce8 limit Reddit data to just English
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Robyn Speer
5a37cc22c7 remove reddit_base_filename function
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Robyn Speer
797895047a use path.stem to make the Reddit filename prefix
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Robyn Speer
a2bc90e430 rename max_size to max_words consistently
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Robyn Speer
a9a4483ca3 fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Robyn Speer
36885b5479 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Robyn Speer
cecf852040 update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Robyn Speer
0c7527140c Discard text detected as an uncommon language; add large German list
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Robyn Speer
aa7802b552 oh look, more spam
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Robyn Speer
2840ca55aa filter out downvoted Reddit posts
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Robyn Speer
16841d4b0c disregard Arabic Reddit spam
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Robyn Speer
034d8f540b fix extraneous dot in intermediate filenames
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Robyn Speer
460fbb84fd bump version to 1.4
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
Robyn Speer
969a024dea actually use the results of language-detection on Reddit
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
fbc19995ab Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Robyn Speer
f493d0eec4 make max-words a real, documented parameter
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00