Commit Graph

510 Commits

Author SHA1 Message Date
Sara Jewett
7b6f88b059 Specify encoding when dealing with files
Former-commit-id: 37f9e12b93
2015-12-23 15:49:13 -05:00
Rob Speer
6d62a8ff51 builder: Use an optional cutoff when merging counts
This allows the Reddit-merging step to not use such a ludicrous amount
of memory.


Former-commit-id: 973caca253
2015-12-15 14:44:34 -05:00
Rob Speer
4e985e3bca gzip the intermediate step of Reddit word counting
Former-commit-id: 9a5d9d66bb
2015-12-09 13:30:08 -05:00
Rob Speer
dc94222d7d no Thai because we can't tokenize it
Former-commit-id: 95f53e295b
2015-12-02 12:38:03 -05:00
Rob Speer
237fabb4c5 forgot about Italian
Former-commit-id: 8f6cd0e57b
2015-11-30 18:18:24 -05:00
Rob Speer
6caa9ca443 add tokenizer for Reddit
Former-commit-id: 5ef807117d
2015-11-30 18:16:54 -05:00
Rob Speer
9a1b00ba0c rebuild data files
Former-commit-id: 2dcf368481
2015-11-30 17:06:39 -05:00
Rob Speer
d1b667909d add word frequencies from the Reddit 2007-2015 corpus
Former-commit-id: b2d7546d2d
2015-11-30 16:38:11 -05:00
Rob Speer
49b8ba4be9 add docstrings to chinese_ and japanese_tokenize
Former-commit-id: e1f7a1ccf3
2015-10-27 13:23:56 -04:00
Lance Nathan
f47249064f Merge pull request #28 from LuminosoInsight/chinese-external-wordlist
Add some tokenizer options

Former-commit-id: ca00dfa1d9
2015-10-19 18:21:52 -04:00
Rob Speer
668a985969 Define globals in relevant places
Former-commit-id: a6b6aa07e7
2015-10-19 18:15:54 -04:00
Rob Speer
f255eb5bd8 clarify the tokenize docstring
Former-commit-id: bfc17fea9f
2015-10-19 12:18:12 -04:00
Rob Speer
8fea2ca181 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: 1793c1bb2e
2015-09-28 14:34:59 -04:00
Andrew Lin
d8422852f4 Merge pull request #29 from LuminosoInsight/code-review-notes-20150925
Fix documentation and clean up, based on Sep 25 code review

Former-commit-id: 15d99be21b
2015-09-28 13:53:50 -04:00
Rob Speer
3bd1fe2fe6 Fix documentation and clean up, based on Sep 25 code review
Former-commit-id: 44b0c4f9ba
2015-09-28 12:58:46 -04:00
Rob Speer
7435c8f57a fix missing word in rules.ninja comment
Former-commit-id: 9b1c4d66cd
2015-09-24 17:56:06 -04:00
Rob Speer
7c596de98a describe optional dependencies better in the README
Former-commit-id: b460eef444
2015-09-24 17:54:52 -04:00
Rob Speer
28381d5a51 update and clean up the tokenize() docstring
Former-commit-id: 24b16d8a5d
2015-09-24 17:47:16 -04:00
Rob Speer
f89ac5e400 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Rob Speer
faf66e9b08 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
c53bb06988 Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 65d6645e81 [formerly db41bc7902].


Former-commit-id: cd0797e1c8
2015-09-24 13:31:34 -04:00
Andrew Lin
566a62abd5 Merge pull request #27 from LuminosoInsight/chinese-and-more
Improve Chinese, Greek, English; add Turkish, Polish, Swedish

Former-commit-id: 710eaabbe1
2015-09-24 13:25:21 -04:00
Andrew Lin
ee6df56514 Revert a small syntax change introduced by a circular series of changes.
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Rob Speer
1b7117952b don't apply the inferred-space penalty to Japanese
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Andrew Lin
4ccfcdc1bd Revert "Remove the no-longer-existent .txt files from the MANIFEST."
This reverts commit 65d6645e81 [formerly db41bc7902].


Former-commit-id: bb70bdba58
2015-09-23 13:02:40 -04:00
Rob Speer
88deef24f6 describe the use of lang in read_values
Former-commit-id: f224b8dbba
2015-09-22 17:22:38 -04:00
Rob Speer
7cb310b28e Make the jieba_deps comment make sense
Former-commit-id: 7c12f2aca1
2015-09-22 17:19:00 -04:00
Rob Speer
d68dd9f568 actually, still delay loading the Jieba tokenizer
Former-commit-id: 48734d1a60
2015-09-22 16:54:39 -04:00
Rob Speer
0e4daa8472 replace the literal 10 with the constant INFERRED_SPACE_FACTOR
Former-commit-id: 7a3ea2bf79
2015-09-22 16:46:07 -04:00
Rob Speer
5929975338 remove unnecessary delayed loads in wordfreq.chinese
Former-commit-id: 4a87890afd
2015-09-22 16:42:13 -04:00
Rob Speer
42ccba4fa6 load the Chinese character mapping from a .msgpack.gz file
Former-commit-id: 6cf4210187
2015-09-22 16:32:33 -04:00
Rob Speer
e12a42f38a document what this file is for
Former-commit-id: 06f8b29971
2015-09-22 15:31:27 -04:00
Rob Speer
76c4a8975a fix README conflict
Former-commit-id: 5b918e7bb0
2015-09-22 14:23:55 -04:00
Rob Speer
963e0ff785 refactor the tokenizer, add include_punctuation option
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Rob Speer
e3a79ab8c9 add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Rob Speer
7f92557a58 Merge branch 'greek-and-turkish' into chinese-and-more
Conflicts:
	README.md
	wordfreq_builder/wordfreq_builder/ninja.py

Former-commit-id: 3cb3061e06
2015-09-10 15:27:33 -04:00
Rob Speer
a13f459f88 Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Andrew Lin
800039f0f8 Merge pull request #26 from LuminosoInsight/greek-and-turkish
Add SUBTLEX, support Turkish, expand Greek

Former-commit-id: acbb25e6f6
2015-09-10 13:48:33 -04:00
Rob Speer
e3cc8eaea9 In ninja deps, remove 'startrow' as a variable
Former-commit-id: a4f8d11427
2015-09-10 13:46:19 -04:00
Rob Speer
5701c1165d fix spelling of Marc
Former-commit-id: 2277ad3116
2015-09-09 13:35:02 -04:00
Rob Speer
9c08442dc5 fixes based on code review notes
Former-commit-id: 354555514f
2015-09-09 13:10:18 -04:00
Rob Speer
37e5e1009f fix SUBTLEX citations
Former-commit-id: 6502f15e9b
2015-09-08 17:45:25 -04:00
Rob Speer
0f9497d864 take out OpenSubtitles for Chinese
Former-commit-id: d9c44d5fcc
2015-09-08 17:25:05 -04:00
Rob Speer
5e86394c4c update comments in wordfreq_builder.config; remove unused 'version'
Former-commit-id: bc323eccaf
2015-09-08 16:15:29 -04:00
Rob Speer
2dfaf7798d sort Jieba wordlists consistently; update data files
Former-commit-id: 0ab23f8a28
2015-09-08 16:09:53 -04:00
Rob Speer
01332f1ed5 don't do language-specific tokenization in freqs_to_cBpack
Tokenizing in the 'merge' step is sufficient.


Former-commit-id: bc8ebd23e9
2015-09-08 14:46:04 -04:00
Rob Speer
86475d6b5f actually fix logic of apostrophe-fixing
Former-commit-id: 715361ca0d
2015-09-08 13:50:34 -04:00
Rob Speer
6bd0979ad2 fix logic of apostrophe-fixing
Former-commit-id: c4c1af8213
2015-09-08 13:47:58 -04:00
Rob Speer
8c3fb9f716 fix '--language' option definition
Former-commit-id: 912171f8e7
2015-09-08 13:27:20 -04:00
Rob Speer
67bb55988e Avoid Chinese tokenizer when building
Former-commit-id: 77a9b5c55b
2015-09-08 12:59:03 -04:00