Commit Graph

  • 0a5e6bd87a consolidate logic about MeCab path length Rob Speer 2016-08-04 15:16:20 -0400
  • 894a96ba7e Getting a newer mecab-ko-dic changed the Korean frequencies Rob Speer 2016-08-02 16:10:41 -0400
  • c11998e506 Getting a newer mecab-ko-dic changed the Korean frequencies Robyn Speer 2016-08-02 16:10:41 -0400
  • 09a904c0fe Getting a newer mecab-ko-dic changed the Korean frequencies Rob Speer 2016-08-02 16:10:41 -0400
  • 8a5d1b298d update find_mecab_dictionary docstring Rob Speer 2016-08-02 12:53:46 -0400
  • bc1cfc35c8 update find_mecab_dictionary docstring Robyn Speer 2016-08-02 12:53:46 -0400
  • c6c44939e6 update find_mecab_dictionary docstring Rob Speer 2016-08-02 12:53:46 -0400
  • 3dffb18557 remove my ad-hoc names for dictionary packages Rob Speer 2016-08-01 17:39:35 -0400
  • 9e55f8fed1 remove my ad-hoc names for dictionary packages Robyn Speer 2016-08-01 17:39:35 -0400
  • 188654396a remove my ad-hoc names for dictionary packages Rob Speer 2016-08-01 17:39:35 -0400
  • b3dd8479ab stop including MeCab dictionaries in the package Rob Speer 2016-08-01 17:37:41 -0400
  • 2787bfd647 stop including MeCab dictionaries in the package Robyn Speer 2016-08-01 17:37:41 -0400
  • 1519df503c stop including MeCab dictionaries in the package Rob Speer 2016-08-01 17:37:41 -0400
  • fcf2445c3e fix MeCab error message Rob Speer 2016-07-29 17:30:02 -0400
  • 875dd5669f fix MeCab error message Robyn Speer 2016-07-29 17:30:02 -0400
  • 410e8c255b fix MeCab error message Rob Speer 2016-07-29 17:30:02 -0400
  • afe6537994 Look for MeCab dictionaries in various places besides this package Rob Speer 2016-07-29 17:27:15 -0400
  • 94712c8312 Look for MeCab dictionaries in various places besides this package Robyn Speer 2016-07-29 17:27:15 -0400
  • c1927732d3 Look for MeCab dictionaries in various places besides this package Rob Speer 2016-07-29 17:27:15 -0400
  • 74892a0ac9 Make the almost-median deterministic when it rounds down to 0 Rob Speer 2016-07-29 12:34:09 -0400
  • ce5a91d732 Make the almost-median deterministic when it rounds down to 0 Robyn Speer 2016-07-29 12:34:09 -0400
  • 1aa63bca6c Make the almost-median deterministic when it rounds down to 0 Rob Speer 2016-07-29 12:34:09 -0400
  • 1a16b0f84c Code review fixes: avoid repeatedly constructing sets Rob Speer 2016-07-29 12:32:26 -0400
  • 15667ea023 Code review fixes: avoid repeatedly constructing sets Robyn Speer 2016-07-29 12:32:26 -0400
  • fcbdf560c2 Code review fixes: avoid repeatedly constructing sets Rob Speer 2016-07-29 12:32:26 -0400
  • 21246f881f Revise multilingual tests Rob Speer 2016-07-29 12:19:12 -0400
  • 68c6d95131 Revise multilingual tests Robyn Speer 2016-07-29 12:19:12 -0400
  • 99b627a300 Revise multilingual tests Rob Speer 2016-07-29 12:19:12 -0400
  • e6a8f028e3 Add Common Crawl data and more languages (#39) Rob Speer 2016-07-28 19:23:17 -0400
  • 2a41d4dc5e Add Common Crawl data and more languages (#39) Robyn Speer 2016-07-28 19:23:17 -0400
  • 9758c69ff0 Add Common Crawl data and more languages (#39) staging-20160729 code-review-20160729 Rob Speer 2016-07-28 19:23:17 -0400
  • 542a5085bb Add comment explaining the max of 2 zeros in the median #39 Rob Speer 2016-07-28 11:31:53 -0400
  • 4f965ba1cb large Greek wasn't supposed to be built in the end Rob Speer 2016-07-26 15:27:31 -0400
  • ad3c39be8d The lol-test works more consistently when it uses the 'twitter' list Rob Speer 2016-07-26 14:50:58 -0400
  • dce407a0a1 Frequency-combiner with fewer special cases Rob Speer 2016-07-26 14:42:51 -0400
  • e0ba1a29c2 Update version to 1.5 Rob Speer 2016-07-25 16:49:59 -0400
  • 381a7ead98 Numbers don't need a boost, they already have Wikipedia Rob Speer 2016-07-25 16:46:22 -0400
  • ef729a877b add Korean laughter Rob Speer 2016-07-25 15:42:49 -0400
  • 1dff084308 Fix docstrings in wordfreq_builder.word_counts Rob Speer 2016-07-25 15:22:59 -0400
  • 9b19fe57ac Merge branch 'master' into common-crawl-counts Rob Speer 2016-07-25 15:15:19 -0400
  • 39bff0d5d8 Update data and tests from new build Rob Speer 2016-07-25 15:12:33 -0400
  • 8c069cb77b Remove kinda bad sources, combine with median Rob Speer 2016-07-25 15:06:01 -0400
  • 0b4e0c7a45 Update wordlists, tests, combining function Rob Speer 2016-07-21 15:35:42 -0400
  • e89a4de90b Combine word frequencies as the weighted median Rob Speer 2016-07-20 15:59:08 -0400
  • 029c6a19c1 Deal with commas and cedillas in Turkish and Romanian Rob Speer 2016-07-20 15:56:42 -0400
  • 7f67806951 new frequency-merging strategy Rob Speer 2016-07-19 19:12:02 -0400
  • 53e9e55c9d include Chinese from Wikipedia (mostly Trad.), Dutch big list Rob Speer 2016-07-15 18:29:39 -0400
  • ff855ab3a6 include Common Crawl; support 11 more languages Rob Speer 2016-07-15 16:00:27 -0400
  • 0a2bfb2710 Tokenization in Korean, plus abjad languages (#38) Robyn Speer 2016-07-15 15:10:25 -0400
  • a0893af82e Tokenization in Korean, plus abjad languages (#38) Rob Speer 2016-07-15 15:10:25 -0400
  • fec6eddcc3 Tokenization in Korean, plus abjad languages (#38) Rob Speer 2016-07-15 15:10:25 -0400
  • baecf92081 add a feature for counting tokens from the Common Crawl Rob Speer 2016-07-14 16:08:17 -0400
  • d1bbd7ed1c combine Japanese and Korean tokenization into the same function #38 Rob Speer 2016-07-14 15:28:36 -0400
  • 3dce920a67 fix terminology in docstrings about abjad scripts Rob Speer 2016-07-13 16:30:25 -0400
  • a2d781c526 add a Hebrew tokenization test Rob Speer 2016-07-13 16:24:27 -0400
  • 0dc5ea232f Add Korean tokenization, and include MeCab files in data Rob Speer 2016-07-13 16:19:13 -0400
  • 3e316c9688 Remove marks from more languages Rob Speer 2016-07-12 18:36:02 -0400
  • 3155cf27e6 Fix tokenization of SE Asian and South Asian scripts (#37) Robyn Speer 2016-07-01 18:00:57 -0400
  • ac24b8eab4 Fix tokenization of SE Asian and South Asian scripts (#37) staging-20160715 code-review-20160715 Rob Speer 2016-07-01 18:00:57 -0400
  • 270f6c7ca6 Fix tokenization of SE Asian and South Asian scripts (#37) Rob Speer 2016-07-01 18:00:57 -0400
  • 965940bc51 Be more specific about the regex Script property #37 Rob Speer 2016-07-01 17:52:31 -0400
  • d352d18528 fix spelling in comment Rob Speer 2016-07-01 17:27:17 -0400
  • ea4a31904c try again, being more specific Rob Speer 2016-07-01 17:24:09 -0400
  • d0de7a55d0 Revise a bit of explanation Rob Speer 2016-07-01 17:21:46 -0400
  • 025c629e87 Fix the tokenization rule to require fewer exceptions Rob Speer 2016-07-01 16:36:50 -0400
  • 5ee2686b94 add another exception for Sindhi script Rob Speer 2016-07-01 15:47:09 -0400
  • 2ee7c79afd Add tests for more scripts Rob Speer 2016-07-01 12:56:41 -0400
  • 8bd41a48c9 Fix tokenization of SE Asian and South Asian scripts Rob Speer 2016-07-01 12:31:15 -0400
  • 8d09b68d37 wordfreq_builder: Document the extract_reddit pipeline Robyn Speer 2016-06-02 15:19:25 -0400
  • f539eecdd6 wordfreq_builder: Document the extract_reddit pipeline staging-20160603 code-review-20160603 Rob Speer 2016-06-02 15:19:25 -0400
  • 88626aafee wordfreq_builder: Document the extract_reddit pipeline Rob Speer 2016-06-02 15:19:25 -0400
  • 046ca4cda3 Merge pull request #35 from LuminosoInsight/big-list-test-fix Andrew Lin 2016-05-11 17:20:01 -0400
  • 6eaae696fe Merge pull request #35 from LuminosoInsight/big-list-test-fix staging-20160520 code-review-20160520 Andrew Lin 2016-05-11 17:20:01 -0400
  • 3a6d985203 Merge pull request #35 from LuminosoInsight/big-list-test-fix Andrew Lin 2016-05-11 17:20:01 -0400
  • c72326e4c0 fix Arabic test, where 'lol' is no longer common Robyn Speer 2016-05-11 17:01:47 -0400
  • c3fd3bd734 fix Arabic test, where 'lol' is no longer common Rob Speer 2016-05-11 17:01:47 -0400
  • da79dfb247 fix Arabic test, where 'lol' is no longer common #35 Rob Speer 2016-05-11 17:01:47 -0400
  • 7a55e0ed86 Merge pull request #34 from LuminosoInsight/big-list Andrew Lin 2016-05-11 16:27:51 -0400
  • 3c2a621743 Merge pull request #34 from LuminosoInsight/big-list Andrew Lin 2016-05-11 16:27:51 -0400
  • e7b34fb655 Merge pull request #34 from LuminosoInsight/big-list Andrew Lin 2016-05-11 16:27:51 -0400
  • 1ac6795709 fix to README: we're only using Reddit in English Robyn Speer 2016-05-11 15:38:29 -0400
  • 4e4c77e7d7 fix to README: we're only using Reddit in English Rob Speer 2016-05-11 15:38:29 -0400
  • dcb77a552b fix to README: we're only using Reddit in English #34 Rob Speer 2016-05-11 15:38:29 -0400
  • a0d93e0ce8 limit Reddit data to just English Robyn Speer 2016-04-15 17:01:21 -0400
  • c5bdc3c6bd limit Reddit data to just English Rob Speer 2016-04-15 17:01:21 -0400
  • 2276d97368 limit Reddit data to just English Rob Speer 2016-04-15 17:01:21 -0400
  • 5a37cc22c7 remove reddit_base_filename function Robyn Speer 2016-03-31 13:39:13 -0400
  • 6f11256ed1 remove reddit_base_filename function Rob Speer 2016-03-31 13:39:13 -0400
  • ced15d6eff remove reddit_base_filename function Rob Speer 2016-03-31 13:39:13 -0400
  • 797895047a use path.stem to make the Reddit filename prefix Robyn Speer 2016-03-31 13:13:52 -0400
  • d924c8e2a5 use path.stem to make the Reddit filename prefix Rob Speer 2016-03-31 13:13:52 -0400
  • ff1f0e4678 use path.stem to make the Reddit filename prefix Rob Speer 2016-03-31 13:13:52 -0400
  • a2bc90e430 rename max_size to max_words consistently Robyn Speer 2016-03-31 12:55:18 -0400
  • 9adc5b92f8 rename max_size to max_words consistently Rob Speer 2016-03-31 12:55:18 -0400
  • 16059d3b9a rename max_size to max_words consistently Rob Speer 2016-03-31 12:55:18 -0400
  • a9a4483ca3 fix table showing marginal Korean support Robyn Speer 2016-03-30 15:11:13 -0400
  • f4aa2cad7b fix table showing marginal Korean support Rob Speer 2016-03-30 15:11:13 -0400
  • 697842b3f9 fix table showing marginal Korean support Rob Speer 2016-03-30 15:11:13 -0400
  • 36885b5479 make an example clearer with wordlist='large' Robyn Speer 2016-03-30 15:08:32 -0400
  • 758e37af07 make an example clearer with wordlist='large' Rob Speer 2016-03-30 15:08:32 -0400