Commit Graph

174 Commits

Author SHA1 Message Date
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
8d09b68d37 wordfreq_builder: Document the extract_reddit pipeline
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Robyn Speer
a0d93e0ce8 limit Reddit data to just English
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Robyn Speer
5a37cc22c7 remove reddit_base_filename function
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Robyn Speer
797895047a use path.stem to make the Reddit filename prefix
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Robyn Speer
a2bc90e430 rename max_size to max_words consistently
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Robyn Speer
0c7527140c Discard text detected as an uncommon language; add large German list
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Robyn Speer
aa7802b552 oh look, more spam
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Robyn Speer
2840ca55aa filter out downvoted Reddit posts
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Robyn Speer
16841d4b0c disregard Arabic Reddit spam
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Robyn Speer
034d8f540b fix extraneous dot in intermediate filenames
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Robyn Speer
969a024dea actually use the results of language-detection on Reddit
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Robyn Speer
fbc19995ab Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Robyn Speer
f493d0eec4 make max-words a real, documented parameter
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Andrew Lin
1942bc690f Restore a missing comma.
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Robyn Speer
6344b38194 Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Robyn Speer
12e779fc79 configuration that builds some larger lists
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Robyn Speer
6eca3cff5a fix documentation in wordfreq_builder.tokenizers
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Robyn Speer
95cdf41fe8 reformat some argparse argument definitions
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Robyn Speer
738243e244 build a bigger wordlist that we can optionally use
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Robyn Speer
2069e30c89 fix usage text: one comment, not one tweet
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Robyn Speer
883aa5baeb Separate tokens with spaces, not line breaks, in intermediate files
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Robyn Speer
7d1719cfb4 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.


Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Robyn Speer
f5e09f3f3d gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Robyn Speer
682e08fee2 no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Robyn Speer
064ee22a33 forgot about Italian
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Robyn Speer
ab8c2e2331 add tokenizer for Reddit
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Robyn Speer
6d2709f064 add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Robyn Speer
7494ae27a7 fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Robyn Speer
d215f79ea3 describe the use of lang in read_values
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Robyn Speer
e6e29a1c03 Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Robyn Speer
f2be213933 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Robyn Speer
c5d5b0b1fe In ninja deps, remove 'startrow' as a variable
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Robyn Speer
acddc3ca05 fix spelling of Marc
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Robyn Speer
872556f7bb fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Robyn Speer
1d3521dfda take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Robyn Speer
59363c8c44 update comments in wordfreq_builder.config; remove unused 'version'
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Robyn Speer
48f9d4520c sort Jieba wordlists consistently; update data files
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Robyn Speer
4aef1dc338 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.


Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Robyn Speer
64b0b76ee1 actually fix logic of apostrophe-fixing
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Robyn Speer
d6d2eac920 fix logic of apostrophe-fixing
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Robyn Speer
523806d6db fix '--language' option definition
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Robyn Speer
099d90b700 Avoid Chinese tokenizer when building
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00
Robyn Speer
3fa14ded28 language-specific frequency reading; fix 't in English
Former-commit-id: 9071defb33
2015-09-08 12:49:21 -04:00
Robyn Speer
1b35ff6b4c Merge branch 'apostrophe-fix' into chinese-scripts
Conflicts:
	wordfreq_builder/wordfreq_builder/word_counts.py

Former-commit-id: 20f2828d0a
2015-09-08 12:29:00 -04:00
Robyn Speer
319c3abaab WIP: fix apostrophe trimming
Former-commit-id: e39d345c4b
2015-09-08 12:28:28 -04:00
Robyn Speer
a4554fb87c tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
7d1c2e72e4 WIP: Traditional Chinese
Former-commit-id: 7906a671ea
2015-09-04 18:52:37 -04:00
Robyn Speer
5b9b2d2d02 add Polish and Swedish, which have sufficient data
Former-commit-id: 447d7e5134
2015-09-04 17:10:40 -04:00
Robyn Speer
a75a95658b We can put the cutoff back now
I took it out when a step in the English SUBTLEX process was outputting
frequencies instead of counts, but I've fixed that now.


Former-commit-id: 5c7a7ea83e
2015-09-04 16:16:52 -04:00