Commit Graph

623 Commits

Author SHA1 Message Date
Robyn Speer
de81a23b9d code review fixes to __init__ 2018-03-14 15:04:59 -04:00
Robyn Speer
8656688b0b fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Robyn Speer
d68d4baad2 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Robyn Speer
0cb36aa74f cache the language info (avoids 10x slowdown) 2018-03-09 14:54:03 -05:00
Robyn Speer
b162de353d avoid log spam: only warn about an unsupported language once 2018-03-09 11:50:15 -05:00
Robyn Speer
c5f64a5de8 update the README 2018-03-08 18:16:15 -05:00
Robyn Speer
d8e3669a73 wordlist updates from new exquisite-corpus 2018-03-08 18:16:00 -05:00
Robyn Speer
53dc0bbb1a Test that we can leave the wordlist unspecified and get 'large' freqs 2018-03-08 18:09:57 -05:00
Robyn Speer
8e3dff3c1c Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f reorganize wordlists into 'small', 'large', and 'best' 2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124 fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00
Robyn Speer
a4d9614e39 setup: update version number and dependencies 2018-03-08 16:26:24 -05:00
Robyn Speer
5ab5d2ea55 Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Robyn Speer
72646f16a1 minor fixes to README 2018-02-28 16:14:50 -05:00
Robyn Speer
cd7bfc4060 Merge pull request #54 from LuminosoInsight/fix-deps
Fix setup.py (version number and msgpack dependency)
2018-02-28 12:46:46 -08:00
Robyn Speer
208559ae1e bump version to 1.7.0, belatedly 2018-02-28 15:15:47 -05:00
Robyn Speer
98cb47c774 update msgpack-python dependency to msgpack 2018-02-28 15:14:51 -05:00
Robyn Speer
ec9c94be92 update citation to v1.7 2017-09-27 13:36:30 -04:00
Andrew Lin
95a13ab4ce Merge pull request #51 from LuminosoInsight/version1.7
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Robyn Speer
b042f2be9d remove unnecessary enumeration from top_n.py 2017-09-08 16:52:06 -04:00
Robyn Speer
fb4a7db6f7 update README for 1.7; sort language list in English order 2017-08-25 17:38:31 -04:00
Robyn Speer
46e32fbd36 v1.7: update tokenization, update data, add bn and mk 2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
6c118c0b6a Merge pull request #49 from LuminosoInsight/restore-langcodes
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Robyn Speer
aa3ed23282 v1.6.1: depend on langcodes 1.4 2017-05-10 13:26:23 -04:00
Robyn Speer
71a0ad6abb Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Robyn Speer
ae7bc5764b Merge pull request #48 from LuminosoInsight/code-review-notes
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
c2e1504643 Clarify the changelog. 2017-02-14 13:09:12 -05:00
Andrew Lin
1363f9d2e0 Correct a case in transliterate.py. 2017-02-14 13:08:23 -05:00
Andrew Lin
72e3678e89 Merge pull request #47 from LuminosoInsight/all-1.6-changes
All 1.6 changes
2017-02-01 15:36:38 -05:00
Robyn Speer
a099a5a881 Remove ninja2dot script, which is no longer used 2017-02-01 14:49:44 -05:00
Robyn Speer
7dec335f74 describe the current problem with 'cyrtranslit' as a dependency 2017-01-31 18:25:52 -05:00
Robyn Speer
19b72132e7 Fix some outdated numbers in English examples 2017-01-31 18:25:41 -05:00
Robyn Speer
abd0820a32 Handle smashing numbers only at the end of tokenize().
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Robyn Speer
93306e55a0 Update README with new examples and URL 2017-01-09 15:13:19 -05:00
Robyn Speer
9a6beb0089 test that number-smashing still happens in freq lookups 2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0 Don't smash numbers in *all* tokenization, just when looking up freqs
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
3cb3c38f47 update the README, citing OpenSubtitles 2016 2017-01-06 19:04:40 -05:00
Robyn Speer
86f22e8523 Mention that multi-digit numbers are combined together 2017-01-05 19:24:28 -05:00
Robyn Speer
48a5967e9a mention tokenization change in changelog 2017-01-05 19:19:31 -05:00
Robyn Speer
39e459ac71 Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Robyn Speer
23c7c8e936 update data from Exquisite Corpus in English and Swedish 2017-01-05 19:17:51 -05:00
Robyn Speer
7dc3f03ebd import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Robyn Speer
de32a15b4f Merge branch 'transliterate-serbian' into all-1.6-changes 2017-01-05 17:57:52 -05:00
Robyn Speer
d66d04210f transliterate: organize the 'borrowed letters' better 2017-01-05 13:23:20 -05:00
Robyn Speer
87b03325db transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Robyn Speer
c27e7f9b76 remove wordfreq_builder (obsoleted by exquisite-corpus) 2017-01-04 17:45:53 -05:00
Robyn Speer
6211b35fb3 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Robyn Speer
0aa7ad46ae fixes to tokenization 2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00