Rob Speer
1519df503c
stop including MeCab dictionaries in the package
...
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Rob Speer
c1927732d3
Look for MeCab dictionaries in various places besides this package
...
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
4e4c77e7d7
fix to README: we're only using Reddit in English
...
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Rob Speer
f4aa2cad7b
fix table showing marginal Korean support
...
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Rob Speer
758e37af07
make an example clearer with wordlist='large'
...
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Rob Speer
c82073270b
update wordlists for new builder settings
...
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Rob Speer
23c5c4adca
Add and document large wordlists
...
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
8fea2ca181
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Rob Speer
3bd1fe2fe6
Fix documentation and clean up, based on Sep 25 code review
...
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Rob Speer
7c596de98a
describe optional dependencies better in the README
...
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Rob Speer
76c4a8975a
fix README conflict
...
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
7f92557a58
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Rob Speer
9c08442dc5
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
37e5e1009f
fix SUBTLEX citations
...
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Rob Speer
0f9497d864
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
b4100b5bfb
update the README for Chinese
...
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
e2a3758832
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e
add Polish and Swedish to README
...
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
138e8aaa3f
add more citations
...
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234
Use SUBTLEX for German, but OpenSubtitles for Greek
...
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.
Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a0997a79a4
update README with additional SUBTLEX support
...
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
bf88f97744
expand list of sources and supported languages
...
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6
support Turkish and more Greek; document more
...
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
a92c398258
add SUBTLEX to the readme
...
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
d883eaeca5
fix heading
...
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181
fix list formatting
...
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
43fd15c938
improve README with function documentation and examples
...
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Rob Speer
d064fbec7d
update the README
...
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Joshua Chin
45799955ab
no use for use
...
Former-commit-id: b0a9a2980f
2015-07-17 14:46:40 -04:00
Andrew Lin
8961729401
Document the version of Unicode used to build the regexes.
...
Former-commit-id: 9f8464c2d1
2015-07-08 18:48:33 -04:00
Rob Speer
51f4e4c826
add installation instructions to the readme
...
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Rob Speer
1f41cb083c
update Japanese data; test Japanese and token combining
...
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00