Rob Speer
74892a0ac9
Make the almost-median deterministic when it rounds down to 0
2016-07-29 12:34:56 -04:00
Rob Speer
1a16b0f84c
Code review fixes: avoid repeatedly constructing sets
2016-07-29 12:32:26 -04:00
Rob Speer
21246f881f
Revise multilingual tests
2016-07-29 12:19:12 -04:00
Rob Speer
e6a8f028e3
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
2016-07-28 19:23:17 -04:00
Rob Speer
fec6eddcc3
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
2016-07-15 15:10:25 -04:00
Rob Speer
270f6c7ca6
Fix tokenization of SE Asian and South Asian scripts ( #37 )
2016-07-01 18:00:57 -04:00
Rob Speer
88626aafee
wordfreq_builder: Document the extract_reddit pipeline
2016-06-02 15:19:25 -04:00
Andrew Lin
3a6d985203
Merge pull request #35 from LuminosoInsight/big-list-test-fix
...
fix Arabic test, where 'lol' is no longer common
2016-05-11 17:20:01 -04:00
Rob Speer
da79dfb247
fix Arabic test, where 'lol' is no longer common
2016-05-11 17:01:47 -04:00
Andrew Lin
e7b34fb655
Merge pull request #34 from LuminosoInsight/big-list
...
wordfreq 1.4: some bigger wordlists, better use of language detection
2016-05-11 16:27:51 -04:00
Rob Speer
dcb77a552b
fix to README: we're only using Reddit in English
2016-05-11 15:38:29 -04:00
Rob Speer
2276d97368
limit Reddit data to just English
2016-04-15 17:01:21 -04:00
Rob Speer
ced15d6eff
remove reddit_base_filename function
2016-03-31 13:39:13 -04:00
Rob Speer
ff1f0e4678
use path.stem
to make the Reddit filename prefix
2016-03-31 13:13:52 -04:00
Rob Speer
16059d3b9a
rename max_size to max_words consistently
2016-03-31 12:55:18 -04:00
Rob Speer
697842b3f9
fix table showing marginal Korean support
2016-03-30 15:11:13 -04:00
Rob Speer
ed32b278cc
make an example clearer with wordlist='large'
2016-03-30 15:08:32 -04:00
Rob Speer
a10c1d7ac0
update wordlists for new builder settings
2016-03-28 12:26:47 -04:00
Rob Speer
abbc295538
Discard text detected as an uncommon language; add large German list
2016-03-28 12:26:02 -04:00
Rob Speer
08130908c7
oh look, more spam
2016-03-24 18:42:47 -04:00
Rob Speer
5b98794b86
filter out downvoted Reddit posts
2016-03-24 18:05:13 -04:00
Rob Speer
cfe68893fa
disregard Arabic Reddit spam
2016-03-24 17:44:30 -04:00
Rob Speer
6feae99381
fix extraneous dot in intermediate filenames
2016-03-24 16:52:44 -04:00
Rob Speer
1df97a579e
bump version to 1.4
2016-03-24 16:29:29 -04:00
Rob Speer
75a4a92110
actually use the results of language-detection on Reddit
2016-03-24 16:27:24 -04:00
Rob Speer
164a5b1a05
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
2016-03-24 14:11:44 -04:00
Rob Speer
178a8b1494
make max-words a real, documented parameter
2016-03-24 14:10:02 -04:00
Rob Speer
7b539f9057
Merge pull request #33 from LuminosoInsight/bugfix
...
Restore a missing comma.
2016-03-24 13:59:50 -04:00
Andrew Lin
38016cf62b
Restore a missing comma.
2016-03-24 13:57:18 -04:00
Andrew Lin
84497429e1
Merge pull request #32 from LuminosoInsight/thai-fix
...
Leave Thai segments alone in the default regex
2016-03-10 11:57:44 -05:00
Rob Speer
4ec6b56faa
move Thai test to where it makes more sense
2016-03-10 11:56:15 -05:00
Rob Speer
07f16e6f03
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
Rob Speer
d79ee37da9
Add and document large wordlists
2016-01-22 16:23:43 -05:00
Rob Speer
c1a12cebec
configuration that builds some larger lists
2016-01-22 14:20:12 -05:00
Rob Speer
9907948d11
add Zipf scale
2016-01-21 14:07:01 -05:00
slibs63
d18fee3d78
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
2016-01-14 15:52:39 -05:00
Rob Speer
8ddc19a5ca
fix documentation in wordfreq_builder.tokenizers
2016-01-13 15:18:12 -05:00
Rob Speer
511fcb6f91
reformat some argparse argument definitions
2016-01-13 12:05:07 -05:00
Rob Speer
df8caaff7d
build a bigger wordlist that we can optionally use
2016-01-12 14:05:57 -05:00
Rob Speer
8d9668d8ab
fix usage text: one comment, not one tweet
2016-01-12 13:05:38 -05:00
Rob Speer
115c74583e
Separate tokens with spaces, not line breaks, in intermediate files
2016-01-12 12:59:18 -05:00
Andrew Lin
f30efebba0
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
2015-12-23 16:13:47 -05:00
Sara Jewett
37f9e12b93
Specify encoding when dealing with files
2015-12-23 15:49:13 -05:00
Rob Speer
973caca253
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
2015-12-15 14:44:34 -05:00
Rob Speer
9a5d9d66bb
gzip the intermediate step of Reddit word counting
2015-12-09 13:30:08 -05:00
Rob Speer
95f53e295b
no Thai because we can't tokenize it
2015-12-02 12:38:03 -05:00
Rob Speer
8f6cd0e57b
forgot about Italian
2015-11-30 18:18:24 -05:00
Rob Speer
5ef807117d
add tokenizer for Reddit
2015-11-30 18:16:54 -05:00
Rob Speer
2dcf368481
rebuild data files
2015-11-30 17:06:39 -05:00
Rob Speer
b2d7546d2d
add word frequencies from the Reddit 2007-2015 corpus
2015-11-30 16:38:11 -05:00