Commit Graph

458 Commits

Author SHA1 Message Date
Rob Speer
894a96ba7e Getting a newer mecab-ko-dic changed the Korean frequencies 2016-08-02 16:10:41 -04:00
Rob Speer
8a5d1b298d update find_mecab_dictionary docstring 2016-08-02 12:53:46 -04:00
Rob Speer
3dffb18557 remove my ad-hoc names for dictionary packages 2016-08-01 17:39:35 -04:00
Rob Speer
b3dd8479ab stop including MeCab dictionaries in the package 2016-08-01 17:37:41 -04:00
Rob Speer
fcf2445c3e fix MeCab error message 2016-07-29 17:30:02 -04:00
Rob Speer
afe6537994 Look for MeCab dictionaries in various places besides this package 2016-07-29 17:27:15 -04:00
Rob Speer
74892a0ac9 Make the almost-median deterministic when it rounds down to 0 2016-07-29 12:34:56 -04:00
Rob Speer
1a16b0f84c Code review fixes: avoid repeatedly constructing sets 2016-07-29 12:32:26 -04:00
Rob Speer
21246f881f Revise multilingual tests 2016-07-29 12:19:12 -04:00
Rob Speer
e6a8f028e3 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian
2016-07-28 19:23:17 -04:00
Rob Speer
fec6eddcc3 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function
2016-07-15 15:10:25 -04:00
Rob Speer
270f6c7ca6 Fix tokenization of SE Asian and South Asian scripts (#37) 2016-07-01 18:00:57 -04:00
Rob Speer
88626aafee wordfreq_builder: Document the extract_reddit pipeline 2016-06-02 15:19:25 -04:00
Andrew Lin
3a6d985203 Merge pull request #35 from LuminosoInsight/big-list-test-fix
fix Arabic test, where 'lol' is no longer common
2016-05-11 17:20:01 -04:00
Rob Speer
da79dfb247 fix Arabic test, where 'lol' is no longer common 2016-05-11 17:01:47 -04:00
Andrew Lin
e7b34fb655 Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection
2016-05-11 16:27:51 -04:00
Rob Speer
dcb77a552b fix to README: we're only using Reddit in English 2016-05-11 15:38:29 -04:00
Rob Speer
2276d97368 limit Reddit data to just English 2016-04-15 17:01:21 -04:00
Rob Speer
ced15d6eff remove reddit_base_filename function 2016-03-31 13:39:13 -04:00
Rob Speer
ff1f0e4678 use path.stem to make the Reddit filename prefix 2016-03-31 13:13:52 -04:00
Rob Speer
16059d3b9a rename max_size to max_words consistently 2016-03-31 12:55:18 -04:00
Rob Speer
697842b3f9 fix table showing marginal Korean support 2016-03-30 15:11:13 -04:00
Rob Speer
ed32b278cc make an example clearer with wordlist='large' 2016-03-30 15:08:32 -04:00
Rob Speer
a10c1d7ac0 update wordlists for new builder settings 2016-03-28 12:26:47 -04:00
Rob Speer
abbc295538 Discard text detected as an uncommon language; add large German list 2016-03-28 12:26:02 -04:00
Rob Speer
08130908c7 oh look, more spam 2016-03-24 18:42:47 -04:00
Rob Speer
5b98794b86 filter out downvoted Reddit posts 2016-03-24 18:05:13 -04:00
Rob Speer
cfe68893fa disregard Arabic Reddit spam 2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381 fix extraneous dot in intermediate filenames 2016-03-24 16:52:44 -04:00
Rob Speer
1df97a579e bump version to 1.4 2016-03-24 16:29:29 -04:00
Rob Speer
75a4a92110 actually use the results of language-detection on Reddit 2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05 Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494 make max-words a real, documented parameter 2016-03-24 14:10:02 -04:00
Rob Speer
7b539f9057 Merge pull request #33 from LuminosoInsight/bugfix
Restore a missing comma.
2016-03-24 13:59:50 -04:00
Andrew Lin
38016cf62b Restore a missing comma. 2016-03-24 13:57:18 -04:00
Andrew Lin
84497429e1 Merge pull request #32 from LuminosoInsight/thai-fix
Leave Thai segments alone in the default regex
2016-03-10 11:57:44 -05:00
Rob Speer
4ec6b56faa move Thai test to where it makes more sense 2016-03-10 11:56:15 -05:00
Rob Speer
07f16e6f03 Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
Rob Speer
d79ee37da9 Add and document large wordlists 2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec configuration that builds some larger lists 2016-01-22 14:20:12 -05:00
Rob Speer
9907948d11 add Zipf scale 2016-01-21 14:07:01 -05:00
slibs63
d18fee3d78 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca fix documentation in wordfreq_builder.tokenizers 2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91 reformat some argparse argument definitions 2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d build a bigger wordlist that we can optionally use 2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab fix usage text: one comment, not one tweet 2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e Separate tokens with spaces, not line breaks, in intermediate files 2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0 Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93 Specify encoding when dealing with files 2015-12-23 15:49:13 -05:00
Rob Speer
973caca253 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00