Commit Graph

49 Commits

Author SHA1 Message Date
Rob Speer
cd434b2219 update data to include xc's processing of ParaCrawl 2018-05-25 16:12:35 -04:00
Rob Speer
aa91e1f291 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Rob Speer
3ec92a8952 Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Rob Speer
a6bb267f89 fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Rob Speer
bac3dcb620 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Rob Speer
49a603ea63 update the README 2018-03-08 18:16:15 -05:00
Rob Speer
846606d892 minor fixes to README 2018-02-28 16:14:50 -05:00
Rob Speer
843ed92223 update citation to v1.7 2017-09-27 13:36:30 -04:00
Rob Speer
396b0f78df update README for 1.7; sort language list in English order 2017-08-25 17:38:31 -04:00
Rob Speer
7fa5e7fc22 Fix some outdated numbers in English examples 2017-01-31 18:25:41 -05:00
Rob Speer
e6114bf0fa Update README with new examples and URL 2017-01-09 15:13:19 -05:00
Rob Speer
d2bb5b78f3 update the README, citing OpenSubtitles 2016 2017-01-06 19:04:40 -05:00
Rob Speer
803ebc25bb Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Rob Speer
872eeb8848 Describe how to cite wordfreq
This citation was generated from our GitHub repository by Zenodo. Their
defaults indicate that anyone who's ever accepted a PR for the code
should go on the author line, and that sounds fine to me.
2016-09-12 18:24:55 -04:00
Rob Speer
1519df503c stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Rob Speer
c1927732d3 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Rob Speer
9758c69ff0 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
4e4c77e7d7 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Rob Speer
f4aa2cad7b fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Rob Speer
758e37af07 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Rob Speer
c82073270b update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Rob Speer
23c5c4adca Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
8fea2ca181 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Rob Speer
3bd1fe2fe6 Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Rob Speer
7c596de98a describe optional dependencies better in the README
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Rob Speer
76c4a8975a fix README conflict
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
7f92557a58 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88 Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Rob Speer
9c08442dc5 fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
37e5e1009f fix SUBTLEX citations
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Rob Speer
0f9497d864 take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
b4100b5bfb update the README for Chinese
Former-commit-id: d576e3294b
2015-09-05 03:42:54 -04:00
Rob Speer
e2a3758832 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
62f5a8eb1e add Polish and Swedish to README
Former-commit-id: 3c3371a9ff
2015-09-04 17:10:40 -04:00
Rob Speer
138e8aaa3f add more citations
Former-commit-id: 8196643509
2015-09-04 15:57:40 -04:00
Rob Speer
c08e593234 Use SUBTLEX for German, but OpenSubtitles for Greek
In German and Greek, SUBTLEX and Hermit Dave turn out to have been
working from the same source data. I looked at the quality of how they
processed the data, and chose SUBTLEX for German, and Dave's wordlist
for Greek.


Former-commit-id: 77c60c29b0
2015-09-04 15:52:21 -04:00
Rob Speer
a0997a79a4 update README with additional SUBTLEX support
Former-commit-id: 81bbe663fb
2015-09-04 13:23:33 -04:00
Rob Speer
bf88f97744 expand list of sources and supported languages
Former-commit-id: d9a1c34d00
2015-09-04 01:03:36 -04:00
Rob Speer
a6ef3224a6 support Turkish and more Greek; document more
Former-commit-id: d94428d454
2015-09-04 00:57:04 -04:00
Rob Speer
a92c398258 add SUBTLEX to the readme
Former-commit-id: e6a2886a66
2015-09-03 18:56:56 -04:00
Rob Speer
d883eaeca5 fix heading
Former-commit-id: 00a2812907
2015-08-28 17:49:38 -04:00
Rob Speer
390a431181 fix list formatting
Former-commit-id: 93f44683c5
2015-08-28 17:49:07 -04:00
Rob Speer
43fd15c938 improve README with function documentation and examples
Former-commit-id: 2370287539
2015-08-28 17:45:50 -04:00
Rob Speer
d064fbec7d update the README
Former-commit-id: 573dd1ec79
2015-08-25 17:44:34 -04:00
Joshua Chin
45799955ab no use for use
Former-commit-id: b0a9a2980f
2015-07-17 14:46:40 -04:00
Andrew Lin
8961729401 Document the version of Unicode used to build the regexes.
Former-commit-id: 9f8464c2d1
2015-07-08 18:48:33 -04:00
Rob Speer
51f4e4c826 add installation instructions to the readme
Former-commit-id: 0f4ca80026
2015-05-28 14:02:12 -04:00
Rob Speer
1f41cb083c update Japanese data; test Japanese and token combining
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00