Rob Speer
9758c69ff0
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
f539eecdd6
wordfreq_builder: Document the extract_reddit pipeline
...
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Andrew Lin
6eaae696fe
Merge pull request #35 from LuminosoInsight/big-list-test-fix
...
fix Arabic test, where 'lol' is no longer common
Former-commit-id: 3a6d985203
2016-05-11 17:20:01 -04:00
Rob Speer
c3fd3bd734
fix Arabic test, where 'lol' is no longer common
...
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Andrew Lin
3c2a621743
Merge pull request #34 from LuminosoInsight/big-list
...
wordfreq 1.4: some bigger wordlists, better use of language detection
Former-commit-id: e7b34fb655
2016-05-11 16:27:51 -04:00
Rob Speer
4e4c77e7d7
fix to README: we're only using Reddit in English
...
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Rob Speer
c5bdc3c6bd
limit Reddit data to just English
...
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Rob Speer
6f11256ed1
remove reddit_base_filename function
...
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Rob Speer
d924c8e2a5
use path.stem
to make the Reddit filename prefix
...
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Rob Speer
9adc5b92f8
rename max_size to max_words consistently
...
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Rob Speer
f4aa2cad7b
fix table showing marginal Korean support
...
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Rob Speer
758e37af07
make an example clearer with wordlist='large'
...
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Rob Speer
c82073270b
update wordlists for new builder settings
...
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Rob Speer
3e34dbdd38
Discard text detected as an uncommon language; add large German list
...
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Rob Speer
1c4a2077a4
oh look, more spam
...
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Rob Speer
cebf99f7ba
filter out downvoted Reddit posts
...
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Rob Speer
fe6d8fea85
disregard Arabic Reddit spam
...
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Rob Speer
d2cc42936f
fix extraneous dot in intermediate filenames
...
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Rob Speer
28028115c2
bump version to 1.4
...
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
Rob Speer
c3364ef821
actually use the results of language-detection on Reddit
...
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Rob Speer
a5fcfd100d
Merge remote-tracking branch 'origin/master' into big-list
...
Conflicts:
wordfreq_builder/wordfreq_builder/cli/merge_counts.py
Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Rob Speer
670ab12f54
make max-words a real, documented parameter
...
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Rob Speer
384cd6a9fc
Merge pull request #33 from LuminosoInsight/bugfix
...
Restore a missing comma.
Former-commit-id: 7b539f9057
2016-03-24 13:59:50 -04:00
Andrew Lin
c85146e156
Restore a missing comma.
...
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Andrew Lin
241956ed7c
Merge pull request #32 from LuminosoInsight/thai-fix
...
Leave Thai segments alone in the default regex
Former-commit-id: 84497429e1
2016-03-10 11:57:44 -05:00
Rob Speer
c2eab6881e
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Rob Speer
a32162c04f
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Rob Speer
23c5c4adca
Add and document large wordlists
...
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
3b95d349e0
configuration that builds some larger lists
...
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Rob Speer
35ee23591e
add Zipf scale
...
Former-commit-id: 9907948d11
2016-01-21 14:07:01 -05:00
slibs63
258f5088e9
Merge pull request #30 from LuminosoInsight/add-reddit
...
Add English data from Reddit corpus
Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Rob Speer
ee8cfb5a50
fix documentation in wordfreq_builder.tokenizers
...
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Rob Speer
56f830d678
reformat some argparse argument definitions
...
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Rob Speer
f4761029d0
build a bigger wordlist that we can optionally use
...
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Rob Speer
83bd019efe
fix usage text: one comment, not one tweet
...
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Rob Speer
1d3485c855
Separate tokens with spaces, not line breaks, in intermediate files
...
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Andrew Lin
c9f679a7a3
Merge pull request #31 from LuminosoInsight/use_encoding
...
Specify encoding when dealing with files
Former-commit-id: f30efebba0
2015-12-23 16:13:47 -05:00
Sara Jewett
7b6f88b059
Specify encoding when dealing with files
...
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
6d62a8ff51
builder: Use an optional cutoff when merging counts
...
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.
Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca
gzip the intermediate step of Reddit word counting
...
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
dc94222d7d
no Thai because we can't tokenize it
...
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Rob Speer
237fabb4c5
forgot about Italian
...
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Rob Speer
6caa9ca443
add tokenizer for Reddit
...
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Rob Speer
9a1b00ba0c
rebuild data files
...
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
d1b667909d
add word frequencies from the Reddit 2007-2015 corpus
...
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
49b8ba4be9
add docstrings to chinese_ and japanese_tokenize
...
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f47249064f
Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
...
Add some tokenizer options
Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Rob Speer
668a985969
Define globals in relevant places
...
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00