Commit Graph

88 Commits

Author SHA1 Message Date
Elia Robyn Lake
ed7dccbf8b update version and documentation 2022-03-10 19:12:45 -05:00
Elia Robyn Lake
bf05b1b1dc estimate the freq distribution of numbers 2022-03-10 18:33:42 -05:00
Elia Robyn Speer
b60ac1b803 Merge remote-tracking branch 'origin/apostrophe-consistency' 2021-09-02 18:13:53 +00:00
Elia Robyn Speer
c2a9fe03f1 use ftfy's uncurl_quotes in lossy_tokenize 2021-09-02 17:47:47 +00:00
Robyn Speer
08816a21d1 Remove Malayalam; support for it isn't ready
There are Unicode normalization problems with Malayalam -- as best I understand
it, Unicode simply neglected to include normalization forms for Malayalam "chillu"
characters even though they changed how they're represented in Unicode 5.1 and
again in Unicode 9.

The result is that words that print the same end up with multiple entries, with
different codepoint sequences that don't normalize to each other.

I certainly don't know how to resolve this, and it would need to be resolved to
have something that we could reasonably call Malayalam word frequencies.
2021-03-30 14:10:58 -04:00
Robyn Speer
90f0e0a88e Update table, remove Galician (only two sources) 2021-03-30 13:17:36 -04:00
Robyn Speer
8777ad0811 remove Swahili, the data isn't reliable 2021-03-29 18:15:58 -04:00
Robyn Speer
ec48c0a123 update data and tests for 2.5 2021-03-29 16:18:08 -04:00
Robyn Speer
ed23bf3ebe specifically test that the long sequence underflows to 0 2021-02-18 15:09:31 -05:00
Robyn Speer
75a56b68fb change math for INFERRED_SPACE_FACTOR to not overflow 2021-02-18 14:44:39 -05:00
Robyn Speer
ad02d96f1b update dependencies and test for consistent results 2020-09-08 16:03:33 -04:00
Robyn Speer
13ce4606b2 fix regex's inconsistent word breaking around apostrophes 2020-04-28 15:19:56 -04:00
Robyn Speer
86b928f967 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Robyn Speer
65692c3d81 Recognize "@" in gender-neutral word endings as part of the token 2018-07-03 13:22:56 -04:00
Robyn Speer
7a32b56c1c Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Robyn Speer
42efcfc1ad relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Robyn Speer
ad0f046f47 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00
Robyn Speer
a975bcedae update tests to include new languages
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Robyn Speer
b3c42be331 port remaining tests to pytest 2018-06-01 16:40:51 -04:00
Robyn Speer
75b4d62084 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Robyn Speer
666f7e51fa Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Robyn Speer
b2663272a7 remove LAUGHTER_WORDS, which is now unused
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Robyn Speer
53dc0bbb1a Test that we can leave the wordlist unspecified and get 'large' freqs 2018-03-08 18:09:57 -05:00
Robyn Speer
8e3dff3c1c Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f reorganize wordlists into 'small', 'large', and 'best' 2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124 fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00
Robyn Speer
5ab5d2ea55 Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Robyn Speer
46e32fbd36 v1.7: update tokenization, update data, add bn and mk 2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Robyn Speer
71a0ad6abb Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Robyn Speer
9a6beb0089 test that number-smashing still happens in freq lookups 2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0 Don't smash numbers in *all* tokenization, just when looking up freqs
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
7dc3f03ebd import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Robyn Speer
87b03325db transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Robyn Speer
6211b35fb3 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Robyn Speer
a8e2fa5acf add a test for "aujourd'hui" 2016-12-06 17:39:40 -05:00
Robyn Speer
21a78f5eb9 Bake the 'h special case into the regex
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
4376636316 add a specific test in Catalan 2016-12-05 18:54:51 -05:00
Robyn Speer
ff5a8f2a65 add tests for French apostrophe tokenization 2016-12-05 18:54:51 -05:00
Robyn Speer
68c6d95131 Revise multilingual tests
Former-commit-id: 21246f881f
2016-07-29 12:19:12 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
c72326e4c0 fix Arabic test, where 'lol' is no longer common
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Robyn Speer
f25985379c move Thai test to where it makes more sense
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Robyn Speer
51e260b713 Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.


Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Robyn Speer
4a4534c466 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Robyn Speer
e15a231401 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
e7d46fb104 Revert a small syntax change introduced by a circular series of changes.
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Robyn Speer
4d00f17477 don't apply the inferred-space penalty to Japanese
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00