Robyn Speer
ed23bf3ebe
specifically test that the long sequence underflows to 0
2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb
change math for INFERRED_SPACE_FACTOR to not overflow
2021-02-18 14:44:39 -05:00
Robyn Speer
13ce4606b2
fix regex's inconsistent word breaking around apostrophes
2020-04-28 15:19:56 -04:00
Robyn Speer
86b928f967
include data from xc rebuild
2018-07-15 01:01:35 -04:00
Robyn Speer
65692c3d81
Recognize "@" in gender-neutral word endings as part of the token
2018-07-03 13:22:56 -04:00
Robyn Speer
7a32b56c1c
Round frequencies to 3 significant digits
2018-06-18 15:21:33 -04:00
Robyn Speer
42efcfc1ad
relax the test that assumed the Chinese list has few ASCII words
2018-06-15 16:29:15 -04:00
Robyn Speer
ad0f046f47
fixes to tests, including that 'test.py' wasn't found by pytest
2018-06-15 15:48:41 -04:00
Robyn Speer
a975bcedae
update tests to include new languages
...
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Robyn Speer
b3c42be331
port remaining tests to pytest
2018-06-01 16:40:51 -04:00
Robyn Speer
75b4d62084
port test.py and test_chinese.py to pytest
2018-06-01 16:33:06 -04:00
Robyn Speer
666f7e51fa
Handle Japanese edge cases in simple_tokenize
2018-04-26 15:53:07 -04:00
Robyn Speer
b2663272a7
remove LAUGHTER_WORDS, which is now unused
...
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Robyn Speer
53dc0bbb1a
Test that we can leave the wordlist unspecified and get 'large' freqs
2018-03-08 18:09:57 -05:00
Robyn Speer
8e3dff3c1c
Traditional Chinese should be preserved through tokenization
2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f
reorganize wordlists into 'small', 'large', and 'best'
2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124
fix az-Latn transliteration, and test
2018-03-08 16:47:36 -05:00
Robyn Speer
5ab5d2ea55
Separate preprocessing from tokenization
2018-03-08 16:26:17 -05:00
Robyn Speer
46e32fbd36
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Robyn Speer
71a0ad6abb
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Robyn Speer
9a6beb0089
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
7dc3f03ebd
import new wordlists from Exquisite Corpus
2017-01-05 17:59:26 -05:00
Robyn Speer
87b03325db
transliterate: Handle unexpected Russian invasions
2017-01-04 18:51:00 -05:00
Robyn Speer
6211b35fb3
Add transliteration of Cyrillic Serbian
2016-12-29 18:27:17 -05:00
Robyn Speer
a8e2fa5acf
add a test for "aujourd'hui"
2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9
Bake the 'h special case into the regex
...
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
4376636316
add a specific test in Catalan
2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65
add tests for French apostrophe tokenization
2016-12-05 18:54:51 -05:00
Robyn Speer
68c6d95131
Revise multilingual tests
...
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Robyn Speer
2a41d4dc5e
Add Common Crawl data and more languages ( #39 )
...
This changes the version from 1.4.2 to 1.5. Things done in this update include:
* include Common Crawl; support 11 more languages
* new frequency-merging strategy
* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list
* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.
* Add Korean tokenization, and include MeCab files in data
* Remove marks from more languages
* Deal with commas and cedillas in Turkish and Romanian
Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710
Tokenization in Korean, plus abjad languages ( #38 )
...
* Remove marks from more languages
* Add Korean tokenization, and include MeCab files in data
* add a Hebrew tokenization test
* fix terminology in docstrings about abjad scripts
* combine Japanese and Korean tokenization into the same function
Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6
Fix tokenization of SE Asian and South Asian scripts ( #37 )
...
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
c72326e4c0
fix Arabic test, where 'lol' is no longer common
...
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Robyn Speer
f25985379c
move Thai test to where it makes more sense
...
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Robyn Speer
51e260b713
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Robyn Speer
4a4534c466
test_chinese: fix typo in comment
...
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e7d46fb104
Revert a small syntax change introduced by a circular series of changes.
...
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477
don't apply the inferred-space penalty to Japanese
...
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Robyn Speer
9a007b9948
refactor the tokenizer, add include_punctuation
option
...
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Robyn Speer
1adbb1aaf1
add external_wordlist
option to tokenize
...
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Robyn Speer
f0c7c3a02c
Lower the frequency of phrases with inferred token boundaries
...
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Robyn Speer
a4554fb87c
tokenize Chinese using jieba and our own frequencies
...
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Robyn Speer
4704131e13
add tests for Turkish
...
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Robyn Speer
8795525372
Use the regex implementation of Unicode segmentation
...
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Andrew Lin
e88cf3fdaf
Document the NFKC-normalized ligature in the Arabic test.
...
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
b0fac15f98
Switch to more explanatory Unicode escapes when testing NFKC normalization.
...
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Joshua Chin
af8050f1b8
ensure removal of tatweels (hopefully)
...
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00