Commit Graph

45 Commits

Author SHA1 Message Date
Rob Speer
c3fd3bd734 fix Arabic test, where 'lol' is no longer common
Former-commit-id: da79dfb247
2016-05-11 17:01:47 -04:00
Rob Speer
c2eab6881e move Thai test to where it makes more sense
Former-commit-id: 4ec6b56faa
2016-03-10 11:56:15 -05:00
Rob Speer
a32162c04f Leave Thai segments alone in the default regex
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.

The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.


Former-commit-id: 07f16e6f03
2016-02-22 14:32:59 -05:00
Rob Speer
f89ac5e400 test_chinese: fix typo in comment
Former-commit-id: 2a84a926f5
2015-09-24 13:41:11 -04:00
Rob Speer
faf66e9b08 Merge branch 'master' into chinese-external-wordlist
Conflicts:
	wordfreq/chinese.py

Former-commit-id: cea2a61444
2015-09-24 13:40:08 -04:00
Andrew Lin
ee6df56514 Revert a small syntax change introduced by a circular series of changes.
Former-commit-id: 09597b7cf3
2015-09-24 13:24:11 -04:00
Rob Speer
1b7117952b don't apply the inferred-space penalty to Japanese
Former-commit-id: db5eda6051
2015-09-24 12:50:06 -04:00
Rob Speer
963e0ff785 refactor the tokenizer, add include_punctuation option
Former-commit-id: e8e6e0a231
2015-09-15 13:26:09 -04:00
Rob Speer
e3a79ab8c9 add external_wordlist option to tokenize
Former-commit-id: 669bd16c13
2015-09-10 18:09:41 -04:00
Rob Speer
a13f459f88 Lower the frequency of phrases with inferred token boundaries
Former-commit-id: 5c8c36f4e3
2015-09-10 14:16:22 -04:00
Rob Speer
91cc82f76d tokenize Chinese using jieba and our own frequencies
Former-commit-id: 2327f2e4d6
2015-09-05 03:16:56 -04:00
Rob Speer
63295fc397 add tests for Turkish
Former-commit-id: fc93c8dc9c
2015-09-04 17:00:05 -04:00
Rob Speer
f4cf46ab9c Use the regex implementation of Unicode segmentation
Former-commit-id: 95998205ad
2015-08-24 17:11:08 -04:00
Andrew Lin
10bddfe09f Document the NFKC-normalized ligature in the Arabic test.
Former-commit-id: 41e1dd41d8
2015-08-03 11:09:44 -04:00
Andrew Lin
a5553676e4 Switch to more explanatory Unicode escapes when testing NFKC normalization.
Former-commit-id: 66c69e6fac
2015-07-31 19:23:42 -04:00
Joshua Chin
423b2d8443 ensure removal of tatweels (hopefully)
Former-commit-id: 173278fdd3
2015-07-20 16:48:36 -04:00
Joshua Chin
d0e0287d71 updated comments
Former-commit-id: 131b916c57
2015-07-17 14:50:12 -04:00
Andrew Lin
081fde93e3 Express the combining of word frequencies in an explicitly associative and commutative way.
Former-commit-id: 32b4033d63
2015-07-09 15:29:05 -04:00
Joshua Chin
b145e02ce4 removed unused imports
Former-commit-id: b9578ae21e
2015-07-07 16:21:22 -04:00
Joshua Chin
927aaae920 updated minimum
Former-commit-id: 59c03e2411
2015-07-07 15:46:33 -04:00
Joshua Chin
53323f8ea7 added arabic tests
Former-commit-id: f83d31a357
2015-07-07 15:10:59 -04:00
Joshua Chin
d88470df4e changed default to minimum for word_frequency
Former-commit-id: 9aa773aa2b
2015-07-07 15:03:26 -04:00
Joshua Chin
54f66d49ee updated tests
Former-commit-id: ca66a5f883
2015-07-07 14:13:28 -04:00
Rob Speer
3bf59fec57 test and document new twitter wordlists
Former-commit-id: 14cb408100
2015-07-01 17:53:38 -04:00
Rob Speer
b84ba2bc2e update data using new build
Former-commit-id: f9a9ee7a82
2015-07-01 11:18:39 -04:00
Rob Speer
8cac81666a case-fold instead of just lowercasing tokens
Former-commit-id: 638467f600
2015-06-30 15:14:02 -04:00
Joshua Chin
5cc3dce834 revert changes to test_not_really_random
Former-commit-id: bbf7b9de34
2015-06-30 11:29:14 -04:00
Joshua Chin
53c558ca90 changed english test to take random ascii words
Former-commit-id: a49b66880e
2015-06-29 11:05:01 -04:00
Joshua Chin
ea5470a85a changed japanese test because the most common japanese ascii word keeps changing
Former-commit-id: 5ed03b006c
2015-06-29 11:04:19 -04:00
Joshua Chin
000491c7cc Japanese people do not 'lol', they 'w'
Former-commit-id: 17f11ebd26
2015-06-29 11:01:13 -04:00
Joshua Chin
09966989fb updated tests for emoji splitting
Former-commit-id: 3bcb3e84a1
2015-06-25 11:25:51 -04:00
Rob Speer
b4600c9bd1 Switch to a more precise centibel scale.
Former-commit-id: 7862a4d2b6
2015-06-22 17:36:30 -04:00
Joshua Chin
529aa9afde updated test because the new tokenizer removes URLs
Former-commit-id: 35f472fcf9
2015-06-18 11:38:28 -04:00
Rob Speer
1f41cb083c update Japanese data; test Japanese and token combining
Former-commit-id: 611a6a35de
2015-05-28 14:01:56 -04:00
Rob Speer
a1c31d3390 remove old tests
Former-commit-id: 410912d8f0
2015-05-21 20:36:09 -04:00
Rob Speer
5b4107bd1d tests for new wordfreq with full coverage
Former-commit-id: df863a5169
2015-05-21 20:34:17 -04:00
Rob Speer
c7c8078883 A different plan for the top-level word_frequency function.
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.

The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.


Former-commit-id: 44ccf40742
2014-02-24 18:03:31 -05:00
Andrew Lin
3340367519 Remove the tests for metanl_word_frequency too. Doh.
Former-commit-id: 68d262791c
2013-11-11 13:21:25 -05:00
Rob Speer
1edee91b05 Clear wordlists before inserting them; yell at Python 2
Former-commit-id: 823b3828cd
2013-11-01 19:29:37 -04:00
Rob Speer
280eca22ce make the tests less picky about numerical exactness
Former-commit-id: 2b2bd943d2
2013-10-31 15:43:19 -04:00
Rob Speer
def8a71b44 The metanl scale is not what I thought it was.
Former-commit-id: 0d2fb21726
2013-10-31 14:38:01 -04:00
Rob Speer
2cf812a64e When strings are inconsistent between py2 and 3, don't test them on py2. 2013-10-31 13:11:13 -04:00
Rob Speer
3063b3915a Revise the build test to compare lengths of wordlists.
The test currently fails on Python 3, for some strange reason.
2013-10-30 13:22:56 -04:00
Rob Speer
be183b2564 Change default values to offsets. 2013-10-29 18:06:47 -04:00
Rob Speer
2907f7f077 now this package has tests 2013-10-29 17:21:55 -04:00