Commit Graph

622 Commits

Author SHA1 Message Date
Rob Speer
a6bb267f89 fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Rob Speer
bac3dcb620 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Rob Speer
e64f409c55 cache the language info (avoids 10x slowdown) 2018-03-09 14:54:03 -05:00
Rob Speer
11e758672e avoid log spam: only warn about an unsupported language once 2018-03-09 11:50:15 -05:00
Rob Speer
49a603ea63 update the README 2018-03-08 18:16:15 -05:00
Rob Speer
92784d1768 wordlist updates from new exquisite-corpus 2018-03-08 18:16:00 -05:00
Rob Speer
1594ba3ad6 Test that we can leave the wordlist unspecified and get 'large' freqs 2018-03-08 18:09:57 -05:00
Rob Speer
47dac3b0b8 Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Rob Speer
5a5acec9ff reorganize wordlists into 'small', 'large', and 'best' 2018-03-08 17:52:44 -05:00
Rob Speer
67e4475763 fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00
Rob Speer
a42cf312ef setup: update version number and dependencies 2018-03-08 16:26:24 -05:00
Rob Speer
45b9bcdbcb Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Rob Speer
846606d892 minor fixes to README 2018-02-28 16:14:50 -05:00
Rob Speer
ad677e12fd
Merge pull request #54 from LuminosoInsight/fix-deps
Fix setup.py (version number and msgpack dependency)
2018-02-28 12:46:46 -08:00
Rob Speer
aadb19c9a3 bump version to 1.7.0, belatedly 2018-02-28 15:15:47 -05:00
Rob Speer
db56528fb6 update msgpack-python dependency to msgpack 2018-02-28 15:14:51 -05:00
Rob Speer
843ed92223 update citation to v1.7 2017-09-27 13:36:30 -04:00
Andrew Lin
721a1e9fd9 Merge pull request #51 from LuminosoInsight/version1.7
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Rob Speer
61b2e4062d remove unnecessary enumeration from top_n.py 2017-09-08 16:52:06 -04:00
Rob Speer
396b0f78df update README for 1.7; sort language list in English order 2017-08-25 17:38:31 -04:00
Rob Speer
e3352392cc v1.7: update tokenization, update data, add bn and mk 2017-08-25 17:37:48 -04:00
Rob Speer
dcef5813b3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
baf6771e97 Merge pull request #49 from LuminosoInsight/restore-langcodes
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Rob Speer
37b4914970 v1.6.1: depend on langcodes 1.4 2017-05-10 13:26:23 -04:00
Rob Speer
d6cdef6039 Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Rob Speer
97042e6f60 Merge pull request #48 from LuminosoInsight/code-review-notes
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
f28a193015 Clarify the changelog. 2017-02-14 13:09:12 -05:00
Andrew Lin
e21bcc2a58 Correct a case in transliterate.py. 2017-02-14 13:08:23 -05:00
Andrew Lin
21b331e898 Merge pull request #47 from LuminosoInsight/all-1.6-changes
All 1.6 changes
2017-02-01 15:36:38 -05:00
Rob Speer
b5b653f0a1 Remove ninja2dot script, which is no longer used 2017-02-01 14:49:44 -05:00
Rob Speer
391a723662 describe the current problem with 'cyrtranslit' as a dependency 2017-01-31 18:25:52 -05:00
Rob Speer
7fa5e7fc22 Fix some outdated numbers in English examples 2017-01-31 18:25:41 -05:00
Rob Speer
68e4ce16cf Handle smashing numbers only at the end of tokenize().
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Rob Speer
e6114bf0fa Update README with new examples and URL 2017-01-09 15:13:19 -05:00
Rob Speer
f03a37e19c test that number-smashing still happens in freq lookups 2017-01-06 19:20:41 -05:00
Rob Speer
4dfa800cd8 Don't smash numbers in *all* tokenization, just when looking up freqs
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Rob Speer
d2bb5b78f3 update the README, citing OpenSubtitles 2016 2017-01-06 19:04:40 -05:00
Rob Speer
3f9c8449ff Mention that multi-digit numbers are combined together 2017-01-05 19:24:28 -05:00
Rob Speer
a05a1c8d5c mention tokenization change in changelog 2017-01-05 19:19:31 -05:00
Rob Speer
803ebc25bb Update documentation and bump version to 1.6 2017-01-05 19:18:06 -05:00
Rob Speer
f9238ac30f update data from Exquisite Corpus in English and Swedish 2017-01-05 19:17:51 -05:00
Rob Speer
f671a1db7f import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Rob Speer
847b85c5b8 Merge branch 'transliterate-serbian' into all-1.6-changes 2017-01-05 17:57:52 -05:00
Rob Speer
e4f40a0ce9 transliterate: organize the 'borrowed letters' better 2017-01-05 13:23:20 -05:00
Rob Speer
99eac54b31 transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Rob Speer
6171b3d066 remove wordfreq_builder (obsoleted by exquisite-corpus) 2017-01-04 17:45:53 -05:00
Rob Speer
b3e5d1c9e9 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Rob Speer
d376f4e2e2 fixes to tokenization 2016-12-13 14:43:29 -05:00
Rob Speer
bb5df3b074 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00
Rob Speer
24e26c4c1d add a test for "aujourd'hui" 2016-12-06 17:39:40 -05:00