Rob Speer
1aa63bca6c
Make the almost-median deterministic when it rounds down to 0
...
Former-commit-id: 74892a0ac9
2016-07-29 12:34:56 -04:00
Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
f539eecdd6
wordfreq_builder: Document the extract_reddit pipeline
...
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Rob Speer
c5bdc3c6bd
limit Reddit data to just English
...
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Rob Speer
6f11256ed1
remove reddit_base_filename function
...
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Rob Speer
d924c8e2a5
use path.stem
to make the Reddit filename prefix
...
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Rob Speer
9adc5b92f8
rename max_size to max_words consistently
...
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Rob Speer
3e34dbdd38
Discard text detected as an uncommon language; add large German list
...
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Rob Speer
1c4a2077a4
oh look, more spam
...
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Rob Speer
cebf99f7ba
filter out downvoted Reddit posts
...
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Rob Speer
fe6d8fea85
disregard Arabic Reddit spam
...
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Rob Speer
d2cc42936f
fix extraneous dot in intermediate filenames
...
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Rob Speer
c3364ef821
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Rob Speer
a5fcfd100d
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Rob Speer
670ab12f54
make max-words a real, documented parameter
...
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Andrew Lin
c85146e156
Restore a missing comma.
...
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Rob Speer
23c5c4adca
Add and document large wordlists
...
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
3b95d349e0
configuration that builds some larger lists
...
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Rob Speer
ee8cfb5a50
fix documentation in wordfreq_builder.tokenizers
...
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Rob Speer
56f830d678
reformat some argparse argument definitions
...
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Rob Speer
f4761029d0
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Rob Speer
83bd019efe
fix usage text: one comment, not one tweet
...
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Rob Speer
1d3485c855
Separate tokens with spaces, not line breaks, in intermediate files
...
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Rob Speer
6d62a8ff51
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
dc94222d7d
no Thai because we can't tokenize it
...
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Rob Speer
237fabb4c5
forgot about Italian
...
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Rob Speer
6caa9ca443
add tokenizer for Reddit
...
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Rob Speer
d1b667909d
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
7435c8f57a
fix missing word in rules.ninja comment
...
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Rob Speer
88deef24f6
describe the use of lang
in read_values
...
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Rob Speer
7cb310b28e
Make the jieba_deps comment make sense
...
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Rob Speer
7f92557a58
Merge branch 'greek-and-turkish' into chinese-and-more
...
Conflicts:
README.md
wordfreq_builder/wordfreq_builder/ninja.py
Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
e3cc8eaea9
In ninja deps, remove 'startrow' as a variable
...
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Rob Speer
5701c1165d
fix spelling of Marc
...
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Rob Speer
9c08442dc5
fixes based on code review notes
...
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
0f9497d864
take out OpenSubtitles for Chinese
...
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
5e86394c4c
update comments in wordfreq_builder.config; remove unused 'version'
...
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Rob Speer
2dfaf7798d
sort Jieba wordlists consistently; update data files
...
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Rob Speer
01332f1ed5
don't do language-specific tokenization in freqs_to_cBpack
...
Tokenizing in the 'merge' step is sufficient.
Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
86475d6b5f
actually fix logic of apostrophe-fixing
...
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Rob Speer
6bd0979ad2
fix logic of apostrophe-fixing
...
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Rob Speer
8c3fb9f716
fix '--language' option definition
...
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Rob Speer
67bb55988e
Avoid Chinese tokenizer when building
...
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Rob Speer
11202ad7f5
language-specific frequency reading; fix 't in English
...
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Rob Speer
30237cf73d
Merge branch 'apostrophe-fix' into chinese-scripts
...
Conflicts:
wordfreq_builder/wordfreq_builder/word_counts.py
Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Rob Speer
854247bf8b
WIP: fix apostrophe trimming
...
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Rob Speer
91cc82f76d
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
e2a3758832
WIP: Traditional Chinese
...
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Rob Speer
a555e5dc13
add Polish and Swedish, which have sufficient data
...
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00