Rob Speer
863d5be522
port test.py and test_chinese.py to pytest
2018-06-01 16:33:06 -04:00
Rob Speer
aa91e1f291
Packaging updates for the new PyPI
...
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".
It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Lance Nathan
968bc3a85a
Merge pull request #56 from LuminosoInsight/japanese-edge-cases
...
Handle Japanese edge cases in `simple_tokenize`
2018-05-01 14:57:45 -04:00
Rob Speer
0a95d96b20
update CHANGELOG for 2.0.1
2018-05-01 14:47:55 -04:00
Rob Speer
3ec92a8952
Handle Japanese edge cases in simple_tokenize
2018-04-26 15:53:07 -04:00
Lance Nathan
e3a1b470d9
Merge pull request #55 from LuminosoInsight/version2
...
Version 2, with standalone text pre-processing
2018-03-15 14:26:49 -04:00
Rob Speer
a759f38540
update the changelog
2018-03-14 17:56:29 -04:00
Rob Speer
6f1a9aaff1
remove LAUGHTER_WORDS, which is now unused
...
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Rob Speer
1a761199cd
More explicit error message for a missing wordlist
2018-03-14 15:10:27 -04:00
Rob Speer
b2bdc8a854
Actually use min_score
in _language_in_list
...
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
2018-03-14 15:08:52 -04:00
Rob Speer
bb2096ae04
code review fixes to wordfreq.tokens
2018-03-14 15:07:45 -04:00
Rob Speer
430fb01e53
code review fixes to __init__
2018-03-14 15:04:59 -04:00
Rob Speer
a6bb267f89
fix mention of dependencies in README
2018-03-14 15:01:08 -04:00
Rob Speer
bac3dcb620
Subtle changes to CJK frequencies
...
This is the result of re-running exquisite-corpus via wordfreq 2. The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.
The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.
In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Rob Speer
e64f409c55
cache the language info (avoids 10x slowdown)
2018-03-09 14:54:03 -05:00
Rob Speer
11e758672e
avoid log spam: only warn about an unsupported language once
2018-03-09 11:50:15 -05:00
Rob Speer
49a603ea63
update the README
2018-03-08 18:16:15 -05:00
Rob Speer
92784d1768
wordlist updates from new exquisite-corpus
2018-03-08 18:16:00 -05:00
Rob Speer
1594ba3ad6
Test that we can leave the wordlist unspecified and get 'large' freqs
2018-03-08 18:09:57 -05:00
Rob Speer
47dac3b0b8
Traditional Chinese should be preserved through tokenization
2018-03-08 18:08:55 -05:00
Rob Speer
5a5acec9ff
reorganize wordlists into 'small', 'large', and 'best'
2018-03-08 17:52:44 -05:00
Rob Speer
67e4475763
fix az-Latn transliteration, and test
2018-03-08 16:47:36 -05:00
Rob Speer
a42cf312ef
setup: update version number and dependencies
2018-03-08 16:26:24 -05:00
Rob Speer
45b9bcdbcb
Separate preprocessing from tokenization
2018-03-08 16:26:17 -05:00
Rob Speer
846606d892
minor fixes to README
2018-02-28 16:14:50 -05:00
Rob Speer
ad677e12fd
Merge pull request #54 from LuminosoInsight/fix-deps
...
Fix setup.py (version number and msgpack dependency)
2018-02-28 12:46:46 -08:00
Rob Speer
aadb19c9a3
bump version to 1.7.0, belatedly
2018-02-28 15:15:47 -05:00
Rob Speer
db56528fb6
update msgpack-python dependency to msgpack
2018-02-28 15:14:51 -05:00
Rob Speer
843ed92223
update citation to v1.7
2017-09-27 13:36:30 -04:00
Andrew Lin
721a1e9fd9
Merge pull request #51 from LuminosoInsight/version1.7
...
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Rob Speer
61b2e4062d
remove unnecessary enumeration from top_n.py
2017-09-08 16:52:06 -04:00
Rob Speer
396b0f78df
update README for 1.7; sort language list in English order
2017-08-25 17:38:31 -04:00
Rob Speer
e3352392cc
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Rob Speer
dcef5813b3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
baf6771e97
Merge pull request #49 from LuminosoInsight/restore-langcodes
...
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Rob Speer
37b4914970
v1.6.1: depend on langcodes 1.4
2017-05-10 13:26:23 -04:00
Rob Speer
d6cdef6039
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Rob Speer
97042e6f60
Merge pull request #48 from LuminosoInsight/code-review-notes
...
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
f28a193015
Clarify the changelog.
2017-02-14 13:09:12 -05:00
Andrew Lin
e21bcc2a58
Correct a case in transliterate.py.
2017-02-14 13:08:23 -05:00
Andrew Lin
21b331e898
Merge pull request #47 from LuminosoInsight/all-1.6-changes
...
All 1.6 changes
2017-02-01 15:36:38 -05:00
Rob Speer
b5b653f0a1
Remove ninja2dot script, which is no longer used
2017-02-01 14:49:44 -05:00
Rob Speer
391a723662
describe the current problem with 'cyrtranslit' as a dependency
2017-01-31 18:25:52 -05:00
Rob Speer
7fa5e7fc22
Fix some outdated numbers in English examples
2017-01-31 18:25:41 -05:00
Rob Speer
68e4ce16cf
Handle smashing numbers only at the end of tokenize().
...
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Rob Speer
e6114bf0fa
Update README with new examples and URL
2017-01-09 15:13:19 -05:00
Rob Speer
f03a37e19c
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Rob Speer
4dfa800cd8
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Rob Speer
d2bb5b78f3
update the README, citing OpenSubtitles 2016
2017-01-06 19:04:40 -05:00
Rob Speer
3f9c8449ff
Mention that multi-digit numbers are combined together
2017-01-05 19:24:28 -05:00