Commit Graph

449 Commits

Author SHA1 Message Date
Rob Speer
9758c69ff0 Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Rob Speer
a0893af82e Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Rob Speer
ac24b8eab4 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Rob Speer
f539eecdd6 wordfreq_builder: Document the extract_reddit pipeline
Former-commit-id: 88626aafee
2016-06-02 15:19:25 -04:00
Andrew Lin
6eaae696fe Merge pull request #35 from LuminosoInsight/big-list-test-fix
fix Arabic test, where 'lol' is no longer common

Former-commit-id: 3a6d985203
2016-05-11 17:20:01 -04:00
Rob Speer
c3fd3bd734 fix Arabic test, where 'lol' is no longer common
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Andrew Lin
3c2a621743 Merge pull request #34 from LuminosoInsight/big-list
wordfreq 1.4: some bigger wordlists, better use of language detection

Former-commit-id: e7b34fb655
2016-05-11 16:27:51 -04:00
Rob Speer
4e4c77e7d7 fix to README: we're only using Reddit in English
Former-commit-id: dcb77a552b
2016-05-11 15:38:29 -04:00
Rob Speer
c5bdc3c6bd limit Reddit data to just English
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Rob Speer
6f11256ed1 remove reddit_base_filename function
Former-commit-id: ced15d6eff
2016-03-31 13:39:13 -04:00
Rob Speer
d924c8e2a5 use path.stem to make the Reddit filename prefix
Former-commit-id: ff1f0e4678
2016-03-31 13:13:52 -04:00
Rob Speer
9adc5b92f8 rename max_size to max_words consistently
Former-commit-id: 16059d3b9a
2016-03-31 12:55:18 -04:00
Rob Speer
f4aa2cad7b fix table showing marginal Korean support
Former-commit-id: 697842b3f9
2016-03-30 15:11:13 -04:00
Rob Speer
758e37af07 make an example clearer with wordlist='large'
Former-commit-id: ed32b278cc
2016-03-30 15:08:32 -04:00
Rob Speer
c82073270b update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Rob Speer
3e34dbdd38 Discard text detected as an uncommon language; add large German list
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Rob Speer
1c4a2077a4 oh look, more spam
Former-commit-id: 08130908c7
2016-03-24 18:42:47 -04:00
Rob Speer
cebf99f7ba filter out downvoted Reddit posts
Former-commit-id: 5b98794b86
2016-03-24 18:05:13 -04:00
Rob Speer
fe6d8fea85 disregard Arabic Reddit spam
Former-commit-id: cfe68893fa
2016-03-24 17:44:30 -04:00
Rob Speer
d2cc42936f fix extraneous dot in intermediate filenames
Former-commit-id: 6feae99381
2016-03-24 16:52:44 -04:00
Rob Speer
28028115c2 bump version to 1.4
Former-commit-id: 1df97a579e
2016-03-24 16:29:29 -04:00
Rob Speer
c3364ef821 actually use the results of language-detection on Reddit
Former-commit-id: 75a4a92110
2016-03-24 16:27:24 -04:00
Rob Speer
a5fcfd100d Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00
Rob Speer
670ab12f54 make max-words a real, documented parameter
Former-commit-id: 178a8b1494
2016-03-24 14:10:02 -04:00
Rob Speer
384cd6a9fc Merge pull request #33 from LuminosoInsight/bugfix
Restore a missing comma.

Former-commit-id: 7b539f9057
2016-03-24 13:59:50 -04:00
Andrew Lin
c85146e156 Restore a missing comma.
Former-commit-id: 38016cf62b
2016-03-24 13:57:18 -04:00
Andrew Lin
241956ed7c Merge pull request #32 from LuminosoInsight/thai-fix
Leave Thai segments alone in the default regex

Former-commit-id: 84497429e1
2016-03-10 11:57:44 -05:00
Rob Speer
c2eab6881e move Thai test to where it makes more sense
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Rob Speer
a32162c04f Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.


Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Rob Speer
23c5c4adca Add and document large wordlists
Former-commit-id: d79ee37da9
2016-01-22 16:23:43 -05:00
Rob Speer
3b95d349e0 configuration that builds some larger lists
Former-commit-id: c1a12cebec
2016-01-22 14:20:12 -05:00
Rob Speer
35ee23591e add Zipf scale
Former-commit-id: 9907948d11
2016-01-21 14:07:01 -05:00
slibs63
258f5088e9 Merge pull request #30 from LuminosoInsight/add-reddit
Add English data from Reddit corpus

Former-commit-id: d18fee3d78
2016-01-14 15:52:39 -05:00
Rob Speer
ee8cfb5a50 fix documentation in wordfreq_builder.tokenizers
Former-commit-id: 8ddc19a5ca
2016-01-13 15:18:12 -05:00
Rob Speer
56f830d678 reformat some argparse argument definitions
Former-commit-id: 511fcb6f91
2016-01-13 12:05:07 -05:00
Rob Speer
f4761029d0 build a bigger wordlist that we can optionally use
Former-commit-id: df8caaff7d
2016-01-12 14:05:57 -05:00
Rob Speer
83bd019efe fix usage text: one comment, not one tweet
Former-commit-id: 8d9668d8ab
2016-01-12 13:05:38 -05:00
Rob Speer
1d3485c855 Separate tokens with spaces, not line breaks, in intermediate files
Former-commit-id: 115c74583e
2016-01-12 12:59:18 -05:00
Andrew Lin
c9f679a7a3 Merge pull request #31 from LuminosoInsight/use_encoding
Specify encoding when dealing with files

Former-commit-id: f30efebba0
2015-12-23 16:13:47 -05:00
Sara Jewett
7b6f88b059 Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
6d62a8ff51 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.


Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
dc94222d7d no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Rob Speer
237fabb4c5 forgot about Italian
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Rob Speer
6caa9ca443 add tokenizer for Reddit
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Rob Speer
9a1b00ba0c rebuild data files
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
d1b667909d add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
49b8ba4be9 add docstrings to chinese_ and japanese_tokenize
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f47249064f Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options

Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Rob Speer
668a985969 Define globals in relevant places
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00