Robyn Speer
6235d88869
Use data from fixed XC build - mostly changes Chinese
2018-05-30 13:09:20 -04:00
Robyn Speer
5762508e7c
commit new data files (Italian changed for some reason)
2018-05-29 17:36:48 -04:00
Robyn Speer
e4cb9a23b6
update data to include xc's processing of ParaCrawl
2018-05-25 16:12:35 -04:00
Robyn Speer
8907423147
Packaging updates for the new PyPI
...
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".
It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Lance Nathan
316670a234
Merge pull request #56 from LuminosoInsight/japanese-edge-cases
...
Handle Japanese edge cases in `simple_tokenize`
2018-05-01 14:57:45 -04:00
Robyn Speer
e0da20b0c4
update CHANGELOG for 2.0.1
2018-05-01 14:47:55 -04:00
Robyn Speer
666f7e51fa
Handle Japanese edge cases in simple_tokenize
2018-04-26 15:53:07 -04:00
Lance Nathan
18f176dbf6
Merge pull request #55 from LuminosoInsight/version2
...
Version 2, with standalone text pre-processing
2018-03-15 14:26:49 -04:00
Robyn Speer
d9bc4af8cd
update the changelog
2018-03-14 17:56:29 -04:00
Robyn Speer
b2663272a7
remove LAUGHTER_WORDS, which is now unused
...
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Robyn Speer
65811d587e
More explicit error message for a missing wordlist
2018-03-14 15:10:27 -04:00
Robyn Speer
2ecf31ee81
Actually use min_score
in _language_in_list
...
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
2018-03-14 15:08:52 -04:00
Robyn Speer
c57032d5cb
code review fixes to wordfreq.tokens
2018-03-14 15:07:45 -04:00
Robyn Speer
de81a23b9d
code review fixes to __init__
2018-03-14 15:04:59 -04:00
Robyn Speer
8656688b0b
fix mention of dependencies in README
2018-03-14 15:01:08 -04:00
Robyn Speer
d68d4baad2
Subtle changes to CJK frequencies
...
This is the result of re-running exquisite-corpus via wordfreq 2. The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.
The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.
In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Robyn Speer
0cb36aa74f
cache the language info (avoids 10x slowdown)
2018-03-09 14:54:03 -05:00
Robyn Speer
b162de353d
avoid log spam: only warn about an unsupported language once
2018-03-09 11:50:15 -05:00
Robyn Speer
c5f64a5de8
update the README
2018-03-08 18:16:15 -05:00
Robyn Speer
d8e3669a73
wordlist updates from new exquisite-corpus
2018-03-08 18:16:00 -05:00
Robyn Speer
53dc0bbb1a
Test that we can leave the wordlist unspecified and get 'large' freqs
2018-03-08 18:09:57 -05:00
Robyn Speer
8e3dff3c1c
Traditional Chinese should be preserved through tokenization
2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f
reorganize wordlists into 'small', 'large', and 'best'
2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124
fix az-Latn transliteration, and test
2018-03-08 16:47:36 -05:00
Robyn Speer
a4d9614e39
setup: update version number and dependencies
2018-03-08 16:26:24 -05:00
Robyn Speer
5ab5d2ea55
Separate preprocessing from tokenization
2018-03-08 16:26:17 -05:00
Robyn Speer
72646f16a1
minor fixes to README
2018-02-28 16:14:50 -05:00
Robyn Speer
cd7bfc4060
Merge pull request #54 from LuminosoInsight/fix-deps
...
Fix setup.py (version number and msgpack dependency)
2018-02-28 12:46:46 -08:00
Robyn Speer
208559ae1e
bump version to 1.7.0, belatedly
2018-02-28 15:15:47 -05:00
Robyn Speer
98cb47c774
update msgpack-python dependency to msgpack
2018-02-28 15:14:51 -05:00
Robyn Speer
ec9c94be92
update citation to v1.7
2017-09-27 13:36:30 -04:00
Andrew Lin
95a13ab4ce
Merge pull request #51 from LuminosoInsight/version1.7
...
Version 1.7: update tokenization, update Wikipedia data, add languages
2017-09-08 17:02:05 -04:00
Robyn Speer
b042f2be9d
remove unnecessary enumeration from top_n.py
2017-09-08 16:52:06 -04:00
Robyn Speer
fb4a7db6f7
update README for 1.7; sort language list in English order
2017-08-25 17:38:31 -04:00
Robyn Speer
46e32fbd36
v1.7: update tokenization, update data, add bn
and mk
2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3
Tokenize by graphemes, not codepoints ( #50 )
...
* Tokenize by graphemes, not codepoints
* Add more documentation to TOKEN_RE
* Remove extra line break
* Update docstring - Brahmic scripts are no longer an exception
* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Andrew Lin
6c118c0b6a
Merge pull request #49 from LuminosoInsight/restore-langcodes
...
Use langcodes when tokenizing again
2017-05-10 16:20:06 -04:00
Robyn Speer
aa3ed23282
v1.6.1: depend on langcodes 1.4
2017-05-10 13:26:23 -04:00
Robyn Speer
71a0ad6abb
Use langcodes when tokenizing again (it no longer connects to a DB)
2017-04-27 15:09:59 -04:00
Robyn Speer
ae7bc5764b
Merge pull request #48 from LuminosoInsight/code-review-notes
...
Code review notes
2017-02-15 12:29:25 -08:00
Andrew Lin
c2e1504643
Clarify the changelog.
2017-02-14 13:09:12 -05:00
Andrew Lin
1363f9d2e0
Correct a case in transliterate.py.
2017-02-14 13:08:23 -05:00
Andrew Lin
72e3678e89
Merge pull request #47 from LuminosoInsight/all-1.6-changes
...
All 1.6 changes
2017-02-01 15:36:38 -05:00
Robyn Speer
a099a5a881
Remove ninja2dot script, which is no longer used
2017-02-01 14:49:44 -05:00
Robyn Speer
7dec335f74
describe the current problem with 'cyrtranslit' as a dependency
2017-01-31 18:25:52 -05:00
Robyn Speer
19b72132e7
Fix some outdated numbers in English examples
2017-01-31 18:25:41 -05:00
Robyn Speer
abd0820a32
Handle smashing numbers only at the end of tokenize().
...
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Robyn Speer
93306e55a0
Update README with new examples and URL
2017-01-09 15:13:19 -05:00
Robyn Speer
9a6beb0089
test that number-smashing still happens in freq lookups
2017-01-06 19:20:41 -05:00
Robyn Speer
573ecc53d0
Don't smash numbers in *all* tokenization, just when looking up freqs
...
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00