Rob Speer
4ec6b56faa
move Thai test to where it makes more sense
2016-03-10 11:56:15 -05:00
Rob Speer
07f16e6f03
Leave Thai segments alone in the default regex
...
Our regex already has a special case to leave Chinese and Japanese alone
when an appropriate tokenizer for the language isn't being used, as
Unicode's default segmentation would make every character into its own
token.
The same thing happens in Thai, and we don't even *have* an appropriate
tokenizer for Thai, so I've added a similar fallback.
2016-02-22 14:32:59 -05:00
Rob Speer
2a84a926f5
test_chinese: fix typo in comment
2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
09597b7cf3
Revert a small syntax change introduced by a circular series of changes.
2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051
don't apply the inferred-space penalty to Japanese
2015-09-24 12:50:06 -04:00
Rob Speer
e8e6e0a231
refactor the tokenizer, add include_punctuation
option
2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13
add external_wordlist
option to tokenize
2015-09-10 18:09:41 -04:00
Rob Speer
5c8c36f4e3
Lower the frequency of phrases with inferred token boundaries
2015-09-10 14:16:22 -04:00
Rob Speer
2327f2e4d6
tokenize Chinese using jieba and our own frequencies
2015-09-05 03:16:56 -04:00
Rob Speer
fc93c8dc9c
add tests for Turkish
2015-09-04 17:00:05 -04:00
Rob Speer
95998205ad
Use the regex implementation of Unicode segmentation
2015-08-24 17:11:08 -04:00
Andrew Lin
41e1dd41d8
Document the NFKC-normalized ligature in the Arabic test.
2015-08-03 11:09:44 -04:00
Andrew Lin
66c69e6fac
Switch to more explanatory Unicode escapes when testing NFKC normalization.
2015-07-31 19:23:42 -04:00
Joshua Chin
173278fdd3
ensure removal of tatweels (hopefully)
2015-07-20 16:48:36 -04:00
Joshua Chin
131b916c57
updated comments
2015-07-17 14:50:12 -04:00
Andrew Lin
32b4033d63
Express the combining of word frequencies in an explicitly associative and commutative way.
2015-07-09 15:29:05 -04:00
Joshua Chin
b9578ae21e
removed unused imports
2015-07-07 16:21:22 -04:00
Joshua Chin
59c03e2411
updated minimum
2015-07-07 15:46:33 -04:00
Joshua Chin
f83d31a357
added arabic tests
2015-07-07 15:10:59 -04:00
Joshua Chin
9aa773aa2b
changed default to minimum for word_frequency
2015-07-07 15:03:26 -04:00
Joshua Chin
ca66a5f883
updated tests
2015-07-07 14:13:28 -04:00
Rob Speer
14cb408100
test and document new twitter wordlists
2015-07-01 17:53:38 -04:00
Rob Speer
f9a9ee7a82
update data using new build
2015-07-01 11:18:39 -04:00
Rob Speer
638467f600
case-fold instead of just lowercasing tokens
2015-06-30 15:14:02 -04:00
Joshua Chin
bbf7b9de34
revert changes to test_not_really_random
2015-06-30 11:29:14 -04:00
Joshua Chin
a49b66880e
changed english test to take random ascii words
2015-06-29 11:05:01 -04:00
Joshua Chin
5ed03b006c
changed japanese test because the most common japanese ascii word keeps changing
2015-06-29 11:04:19 -04:00
Joshua Chin
17f11ebd26
Japanese people do not 'lol', they 'w'
2015-06-29 11:01:13 -04:00
Joshua Chin
3bcb3e84a1
updated tests for emoji splitting
2015-06-25 11:25:51 -04:00
Rob Speer
7862a4d2b6
Switch to a more precise centibel scale.
2015-06-22 17:36:30 -04:00
Joshua Chin
35f472fcf9
updated test because the new tokenizer removes URLs
2015-06-18 11:38:28 -04:00
Rob Speer
611a6a35de
update Japanese data; test Japanese and token combining
2015-05-28 14:01:56 -04:00
Rob Speer
410912d8f0
remove old tests
2015-05-21 20:36:09 -04:00
Rob Speer
df863a5169
tests for new wordfreq with full coverage
2015-05-21 20:34:17 -04:00
Rob Speer
44ccf40742
A different plan for the top-level word_frequency function.
...
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.
The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.
2014-02-24 18:03:31 -05:00
Andrew Lin
68d262791c
Remove the tests for metanl_word_frequency too. Doh.
2013-11-11 13:21:25 -05:00
Rob Speer
823b3828cd
Clear wordlists before inserting them; yell at Python 2
2013-11-01 19:29:37 -04:00
Rob Speer
2b2bd943d2
make the tests less picky about numerical exactness
2013-10-31 15:43:19 -04:00
Rob Speer
0d2fb21726
The metanl scale is not what I thought it was.
2013-10-31 14:38:01 -04:00
Rob Speer
2cf812a64e
When strings are inconsistent between py2 and 3, don't test them on py2.
2013-10-31 13:11:13 -04:00
Rob Speer
3063b3915a
Revise the build test to compare lengths of wordlists.
...
The test currently fails on Python 3, for some strange reason.
2013-10-30 13:22:56 -04:00
Rob Speer
be183b2564
Change default values to offsets.
2013-10-29 18:06:47 -04:00
Rob Speer
2907f7f077
now this package has tests
2013-10-29 17:21:55 -04:00