Rob Speer
2a84a926f5
test_chinese: fix typo in comment
2015-09-24 13:41:11 -04:00
Rob Speer
cea2a61444
Merge branch 'master' into chinese-external-wordlist
...
Conflicts:
wordfreq/chinese.py
2015-09-24 13:40:08 -04:00
Andrew Lin
09597b7cf3
Revert a small syntax change introduced by a circular series of changes.
2015-09-24 13:24:11 -04:00
Rob Speer
db5eda6051
don't apply the inferred-space penalty to Japanese
2015-09-24 12:50:06 -04:00
Rob Speer
e8e6e0a231
refactor the tokenizer, add include_punctuation
option
2015-09-15 13:26:09 -04:00
Rob Speer
669bd16c13
add external_wordlist
option to tokenize
2015-09-10 18:09:41 -04:00
Rob Speer
5c8c36f4e3
Lower the frequency of phrases with inferred token boundaries
2015-09-10 14:16:22 -04:00
Rob Speer
2327f2e4d6
tokenize Chinese using jieba and our own frequencies
2015-09-05 03:16:56 -04:00
Rob Speer
fc93c8dc9c
add tests for Turkish
2015-09-04 17:00:05 -04:00
Rob Speer
95998205ad
Use the regex implementation of Unicode segmentation
2015-08-24 17:11:08 -04:00
Andrew Lin
41e1dd41d8
Document the NFKC-normalized ligature in the Arabic test.
2015-08-03 11:09:44 -04:00
Andrew Lin
66c69e6fac
Switch to more explanatory Unicode escapes when testing NFKC normalization.
2015-07-31 19:23:42 -04:00
Joshua Chin
173278fdd3
ensure removal of tatweels (hopefully)
2015-07-20 16:48:36 -04:00
Joshua Chin
131b916c57
updated comments
2015-07-17 14:50:12 -04:00
Andrew Lin
32b4033d63
Express the combining of word frequencies in an explicitly associative and commutative way.
2015-07-09 15:29:05 -04:00
Joshua Chin
b9578ae21e
removed unused imports
2015-07-07 16:21:22 -04:00
Joshua Chin
59c03e2411
updated minimum
2015-07-07 15:46:33 -04:00
Joshua Chin
f83d31a357
added arabic tests
2015-07-07 15:10:59 -04:00
Joshua Chin
9aa773aa2b
changed default to minimum for word_frequency
2015-07-07 15:03:26 -04:00
Joshua Chin
ca66a5f883
updated tests
2015-07-07 14:13:28 -04:00
Rob Speer
14cb408100
test and document new twitter wordlists
2015-07-01 17:53:38 -04:00
Rob Speer
f9a9ee7a82
update data using new build
2015-07-01 11:18:39 -04:00
Rob Speer
638467f600
case-fold instead of just lowercasing tokens
2015-06-30 15:14:02 -04:00
Joshua Chin
bbf7b9de34
revert changes to test_not_really_random
2015-06-30 11:29:14 -04:00
Joshua Chin
a49b66880e
changed english test to take random ascii words
2015-06-29 11:05:01 -04:00
Joshua Chin
5ed03b006c
changed japanese test because the most common japanese ascii word keeps changing
2015-06-29 11:04:19 -04:00
Joshua Chin
17f11ebd26
Japanese people do not 'lol', they 'w'
2015-06-29 11:01:13 -04:00
Joshua Chin
3bcb3e84a1
updated tests for emoji splitting
2015-06-25 11:25:51 -04:00
Rob Speer
7862a4d2b6
Switch to a more precise centibel scale.
2015-06-22 17:36:30 -04:00
Joshua Chin
35f472fcf9
updated test because the new tokenizer removes URLs
2015-06-18 11:38:28 -04:00
Rob Speer
611a6a35de
update Japanese data; test Japanese and token combining
2015-05-28 14:01:56 -04:00
Rob Speer
410912d8f0
remove old tests
2015-05-21 20:36:09 -04:00
Rob Speer
df863a5169
tests for new wordfreq with full coverage
2015-05-21 20:34:17 -04:00
Rob Speer
44ccf40742
A different plan for the top-level word_frequency function.
...
When, before, I was importing wordfreq.query at the top level, this
created a dependency loop when installing wordfreq.
The new top-level __init__.py provides just a `word_frequency` function,
which imports the real function as needed and calls it. This should
avoid the dependency loop, at the cost of making
`wordfreq.word_frequency` slightly less efficient than
`wordfreq.query.word_frequency`.
2014-02-24 18:03:31 -05:00
Andrew Lin
68d262791c
Remove the tests for metanl_word_frequency too. Doh.
2013-11-11 13:21:25 -05:00
Rob Speer
823b3828cd
Clear wordlists before inserting them; yell at Python 2
2013-11-01 19:29:37 -04:00
Rob Speer
2b2bd943d2
make the tests less picky about numerical exactness
2013-10-31 15:43:19 -04:00
Rob Speer
0d2fb21726
The metanl scale is not what I thought it was.
2013-10-31 14:38:01 -04:00
Rob Speer
2cf812a64e
When strings are inconsistent between py2 and 3, don't test them on py2.
2013-10-31 13:11:13 -04:00
Rob Speer
3063b3915a
Revise the build test to compare lengths of wordlists.
...
The test currently fails on Python 3, for some strange reason.
2013-10-30 13:22:56 -04:00
Rob Speer
be183b2564
Change default values to offsets.
2013-10-29 18:06:47 -04:00
Rob Speer
2907f7f077
now this package has tests
2013-10-29 17:21:55 -04:00