Commit Graph

35 Commits

Author SHA1 Message Date
Robyn Speer
2787bfd647 stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
94712c8312 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
1ac6795709 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Robyn Speer
a9a4483ca3 fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Robyn Speer
36885b5479 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Robyn Speer
cecf852040 update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Robyn Speer
6344b38194 Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
c9693c9502 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Robyn Speer
f3f66508bd Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Robyn Speer
8e963dc312 describe optional dependencies better in the README
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Robyn Speer
6802a4f89d fix README conflict
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Robyn Speer
f2be213933 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
f0c7c3a02c Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Robyn Speer
872556f7bb fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
3dd70ed1c2 fix SUBTLEX citations
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Robyn Speer
1d3521dfda take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
c1f27d3095 update the README for Chinese
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Robyn Speer
7d1c2e72e4 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
e77c2dbca8 add Polish and Swedish to README
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Robyn Speer
032fea27c3 add more citations
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Robyn Speer
8277b34571 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Robyn Speer
37e510345d update README with additional SUBTLEX support
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Robyn Speer
3cb4dd777e expand list of sources and supported languages
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Robyn Speer
574c383202 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Robyn Speer
d267e0967c add SUBTLEX to the readme
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Robyn Speer
942761d2f6 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Robyn Speer
7bdffaae5c fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Robyn Speer
44c655d9a6 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Robyn Speer
a3a3180bb9 update the README
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Joshua Chin
4c7910246e no use for use
Former-commit-id: b0a9a2980f
2015-07-17 14:46:40 -04:00
Andrew Lin
383963f8a9 Document the version of Unicode used to build the regexes.
Former-commit-id: 9f8464c2d1
2015-07-08 18:48:33 -04:00
Robyn Speer
a3cc8d403c add installation instructions to the readme
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Robyn Speer
860e929bf8 update Japanese data; test Japanese and token combining
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00