Robyn Speer
d8e3669a73
wordlist updates from new exquisite-corpus
2018-03-08 18:16:00 -05:00
Robyn Speer
53dc0bbb1a
Test that we can leave the wordlist unspecified and get 'large' freqs
2018-03-08 18:09:57 -05:00
Robyn Speer
8e3dff3c1c
Traditional Chinese should be preserved through tokenization
2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f
reorganize wordlists into 'small', 'large', and 'best'
2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124
fix az-Latn transliteration, and test
2018-03-08 16:47:36 -05:00
Robyn Speer
a4d9614e39
setup: update version number and dependencies
2018-03-08 16:26:24 -05:00
Robyn Speer
5ab5d2ea55
Separate preprocessing from tokenization
2018-03-08 16:26:17 -05:00
Robyn Speer
72646f16a1
minor fixes to README
2018-02-28 16:14:50 -05:00
Robyn Speer
cd7bfc4060
Merge pull request #54 from LuminosoInsight/fix-deps
...
Fix setup.py (version number and msgpack dependency)
2018-02-28 12:46:46 -08:00
Robyn Speer
208559ae1e
bump version to 1.7.0, belatedly
2018-02-28 15:15:47 -05:00
Robyn Speer
98cb47c774
update msgpack-python dependency to msgpack
2018-02-28 15:14:51 -05:00
Robyn Speer
ec9c94be92
update citation to v1.7
2017-09-27 13:36:30 -04:00
Andrew Lin
95a13ab4ce
Merge pull request #51 from LuminosoInsight/version1.7
...
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Robyn Speer
b042f2be9d
remove unnecessary enumeration from top_n.py
2017-09-08 16:52:06 -04:00
Robyn Speer
fb4a7db6f7
update README for 1.7; sort language list in English order
2017-08-25 17:38:31 -04:00
Robyn Speer
46e32fbd36
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
6c118c0b6a
Merge pull request #49 from LuminosoInsight/restore-langcodes
...
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Robyn Speer
aa3ed23282
v1.6.1: depend on langcodes 1.4
2017-05-10 13:26:23 -04:00
Robyn Speer
71a0ad6abb
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Robyn Speer
ae7bc5764b
Merge pull request #48 from LuminosoInsight/code-review-notes
...
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
c2e1504643
Clarify the changelog.
2017-02-14 13:09:12 -05:00
Andrew Lin
1363f9d2e0
Correct a case in transliterate.py.
2017-02-14 13:08:23 -05:00
Andrew Lin
72e3678e89
Merge pull request #47 from LuminosoInsight/all-1.6-changes
...
All 1.6 changes
2017-02-01 15:36:38 -05:00
Robyn Speer
a099a5a881
Remove ninja2dot script, which is no longer used
2017-02-01 14:49:44 -05:00
Robyn Speer
7dec335f74
describe the current problem with 'cyrtranslit' as a dependency
2017-01-31 18:25:52 -05:00
Robyn Speer
19b72132e7
Fix some outdated numbers in English examples
2017-01-31 18:25:41 -05:00
Robyn Speer
abd0820a32
Handle smashing numbers only at the end of tokenize().
...
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Robyn Speer
93306e55a0
Update README with new examples and URL
2017-01-09 15:13:19 -05:00
Robyn Speer
9a6beb0089
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
3cb3c38f47
update the README, citing OpenSubtitles 2016
2017-01-06 19:04:40 -05:00
Robyn Speer
86f22e8523
Mention that multi-digit numbers are combined together
2017-01-05 19:24:28 -05:00
Robyn Speer
48a5967e9a
mention tokenization change in changelog
2017-01-05 19:19:31 -05:00
Robyn Speer
39e459ac71
Update documentation and bump version to 1.6
2017-01-05 19:18:06 -05:00
Robyn Speer
23c7c8e936
update data from Exquisite Corpus in English and Swedish
2017-01-05 19:17:51 -05:00
Robyn Speer
7dc3f03ebd
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Robyn Speer
de32a15b4f
Merge branch 'transliterate-serbian' into all-1.6-changes
2017-01-05 17:57:52 -05:00
Robyn Speer
d66d04210f
transliterate: organize the 'borrowed letters' better
2017-01-05 13:23:20 -05:00
Robyn Speer
87b03325db
transliterate: Handle unexpected Russian invasions
2017-01-04 18:51:00 -05:00
Robyn Speer
c27e7f9b76
remove wordfreq_builder (obsoleted by exquisite-corpus)
2017-01-04 17:45:53 -05:00
Robyn Speer
6211b35fb3
Add transliteration of Cyrillic Serbian
2016-12-29 18:27:17 -05:00
Robyn Speer
0aa7ad46ae
fixes to tokenization
2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74
Replace multi-digit sequences with zeroes
2016-12-09 15:55:08 -05:00
Robyn Speer
a8e2fa5acf
add a test for "aujourd'hui"
2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9
Bake the 'h special case into the regex
...
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
82eba05f2d
eh, this is still version 1.5.2, not 1.6
2016-12-05 18:58:33 -05:00
Robyn Speer
4376636316
add a specific test in Catalan
2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65
add tests for French apostrophe tokenization
2016-12-05 18:54:51 -05:00
Robyn Speer
596368ac6e
fix tokenization of words like "l'heure"
2016-12-05 18:54:51 -05:00