Commit Graph

564 Commits

Author SHA1 Message Date
Robyn Speer
36fd42ca08 update msgpack call in scripts/make_chinese_mapping 2019-02-05 11:16:22 -05:00
Robyn Speer
c7a14cd4ab update encoding='utf-8' to raw=False 2019-02-04 14:57:38 -05:00
Robyn Speer
4cd7b4bada Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Lance Nathan
fa8be1962b
Merge pull request #62 from LuminosoInsight/name-update
Update my name and the Zenodo citation
2018-10-03 17:30:47 -04:00
Robyn Speer
51ca052b62 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Lance Nathan
bc12599010
Merge pull request #60 from LuminosoInsight/gender-neutral-at
Recognize "@" in gender-neutral word endings as part of the token
2018-07-24 18:16:31 -04:00
Rob Speer
d9fc6ec42c update the changelog for version 2.2 2018-07-23 16:38:39 -04:00
Rob Speer
0644c8920a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Rob Speer
d06a6a48c5 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Rob Speer
b2d242e8bf Recognize "@" in gender-neutral word endings as part of the token 2018-07-03 13:22:56 -04:00
Rob Speer
ca9cf7d90f update the CHANGELOG for MeCab fix 2018-06-26 11:31:03 -04:00
Lance Nathan
3961a28973
Merge pull request #59 from LuminosoInsight/korean-install-fixes
Korean install fixes
2018-06-26 11:08:06 -04:00
Lance Nathan
a619ba6457
Merge pull request #58 from LuminosoInsight/significant-figures
Round wordfreq output to 3 sig. figs, and update documentation
2018-06-25 18:53:39 -04:00
Rob Speer
676686fda1 Fix instructions and search path for mecab-ko-dic
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.

I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.

When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00
Rob Speer
5e05c942ac doctest the README 2018-06-18 17:11:42 -04:00
Rob Speer
1dc763c9c5 update README and CHANGELOG 2018-06-18 15:21:43 -04:00
Rob Speer
c3b32b3c4a Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Lance Nathan
0911e90ba0
Merge pull request #57 from LuminosoInsight/version2.1
Version 2.1
2018-06-18 12:06:47 -04:00
Rob Speer
2b85a1cef2 update table in README: Dutch has 5 sources 2018-06-18 11:43:52 -04:00
Rob Speer
52aae3459d fix typo in previous changelog entry 2018-06-18 10:52:28 -04:00
Rob Speer
2f6b87c86b relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Rob Speer
57f676f4a6 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00
Rob Speer
93e3e03c60 update tests to include new languages
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Rob Speer
93ddc192d8 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Rob Speer
ff4f7bf3f6 Merge remote-tracking branch 'origin/pytest' into version2.1 2018-06-12 17:46:48 -04:00
Rob Speer
db43e0e25c New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00
Rob Speer
96a01b9685 port remaining tests to pytest 2018-06-01 16:40:51 -04:00
Rob Speer
863d5be522 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Rob Speer
8fcae9978e Use data from fixed XC build - mostly changes Chinese 2018-05-30 13:09:20 -04:00
Rob Speer
90b5246a48 commit new data files (Italian changed for some reason) 2018-05-29 17:36:48 -04:00
Rob Speer
cd434b2219 update data to include xc's processing of ParaCrawl 2018-05-25 16:12:35 -04:00
Rob Speer
aa91e1f291 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Lance Nathan
968bc3a85a
Merge pull request #56 from LuminosoInsight/japanese-edge-cases
Handle Japanese edge cases in `simple_tokenize`
2018-05-01 14:57:45 -04:00
Rob Speer
0a95d96b20 update CHANGELOG for 2.0.1 2018-05-01 14:47:55 -04:00
Rob Speer
3ec92a8952 Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Lance Nathan
e3a1b470d9
Merge pull request #55 from LuminosoInsight/version2
Version 2, with standalone text pre-processing
2018-03-15 14:26:49 -04:00
Rob Speer
a759f38540 update the changelog 2018-03-14 17:56:29 -04:00
Rob Speer
6f1a9aaff1 remove LAUGHTER_WORDS, which is now unused
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Rob Speer
1a761199cd More explicit error message for a missing wordlist 2018-03-14 15:10:27 -04:00
Rob Speer
b2bdc8a854 Actually use min_score in _language_in_list
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
2018-03-14 15:08:52 -04:00
Rob Speer
bb2096ae04 code review fixes to wordfreq.tokens 2018-03-14 15:07:45 -04:00
Rob Speer
430fb01e53 code review fixes to __init__ 2018-03-14 15:04:59 -04:00
Rob Speer
a6bb267f89 fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Rob Speer
bac3dcb620 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Rob Speer
e64f409c55 cache the language info (avoids 10x slowdown) 2018-03-09 14:54:03 -05:00
Rob Speer
11e758672e avoid log spam: only warn about an unsupported language once 2018-03-09 11:50:15 -05:00
Rob Speer
49a603ea63 update the README 2018-03-08 18:16:15 -05:00
Rob Speer
92784d1768 wordlist updates from new exquisite-corpus 2018-03-08 18:16:00 -05:00
Rob Speer
1594ba3ad6 Test that we can leave the wordlist unspecified and get 'large' freqs 2018-03-08 18:09:57 -05:00
Rob Speer
47dac3b0b8 Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00