Commit Graph

617 Commits

Author SHA1 Message Date
Robyn Speer
a3834180c9 update changelog for v2.2.1 2019-02-05 15:58:10 -05:00
Lance Nathan
96b9808550 Merge pull request #66 from LuminosoInsight/update-msgpack-call
Update msgpack parameter
2019-02-05 11:17:07 -05:00
Robyn Speer
dd72051929 update msgpack call in scripts/make_chinese_mapping 2019-02-05 11:16:22 -05:00
Robyn Speer
61a1604b38 update encoding='utf-8' to raw=False 2019-02-04 14:57:38 -05:00
Moss Collum
65a6a89993 Add Jenkinsfile to drive internal build scripts 2019-02-01 19:05:35 -05:00
Robyn Speer
d30183a7d7 Allow a wider range of 'regex' versions
The behavior of segmentation shouldn't change within this range, and it
includes the version currently used by SpaCy.
2018-10-25 11:07:55 -04:00
Lance Nathan
c1fe37bab5 Merge pull request #62 from LuminosoInsight/name-update
Update my name and the Zenodo citation
2018-10-03 17:30:47 -04:00
Robyn Speer
563e8f7444 Update my name and the Zenodo citation 2018-10-03 17:27:10 -04:00
Lance Nathan
2f8600e975 Merge pull request #60 from LuminosoInsight/gender-neutral-at
Recognize "@" in gender-neutral word endings as part of the token
2018-07-24 18:16:31 -04:00
Robyn Speer
287df17a71 update the changelog for version 2.2 2018-07-23 16:38:39 -04:00
Robyn Speer
f73406c69a Update README to describe @ tokenization 2018-07-23 11:21:44 -04:00
Robyn Speer
86b928f967 include data from xc rebuild 2018-07-15 01:01:35 -04:00
Robyn Speer
65692c3d81 Recognize "@" in gender-neutral word endings as part of the token 2018-07-03 13:22:56 -04:00
Robyn Speer
7bf69595bb update the CHANGELOG for MeCab fix 2018-06-26 11:31:03 -04:00
Lance Nathan
0149e9ec7f Merge pull request #59 from LuminosoInsight/korean-install-fixes
Korean install fixes
2018-06-26 11:08:06 -04:00
Lance Nathan
79caa526c3 Merge pull request #58 from LuminosoInsight/significant-figures
Round wordfreq output to 3 sig. figs, and update documentation
2018-06-25 18:53:39 -04:00
Robyn Speer
830157d8e4 Fix instructions and search path for mecab-ko-dic
I'm starting a new Python environment on a new Ubuntu installation. You
never know when a huge yak will show up and demand to be shaved.

I tried following the directions in the README, and found that a couple
of steps were missing. I've added those.

When you follow those steps, it appears to install the MeCab Korean
dictionary in `/usr/lib/x86_64-linux-gnu/mecab/dic`, which was none
of the paths we were checking, so I've added that as a search path.
2018-06-21 15:56:54 -04:00
Robyn Speer
fdf064b234 doctest the README 2018-06-18 17:11:42 -04:00
Robyn Speer
c6552f923f update README and CHANGELOG 2018-06-18 15:21:43 -04:00
Robyn Speer
7a32b56c1c Round frequencies to 3 significant digits 2018-06-18 15:21:33 -04:00
Lance Nathan
a95b360563 Merge pull request #57 from LuminosoInsight/version2.1
Version 2.1
2018-06-18 12:06:47 -04:00
Robyn Speer
39a1308770 update table in README: Dutch has 5 sources 2018-06-18 11:43:52 -04:00
Robyn Speer
0280f82496 fix typo in previous changelog entry 2018-06-18 10:52:28 -04:00
Robyn Speer
42efcfc1ad relax the test that assumed the Chinese list has few ASCII words 2018-06-15 16:29:15 -04:00
Robyn Speer
ad0f046f47 fixes to tests, including that 'test.py' wasn't found by pytest 2018-06-15 15:48:41 -04:00
Robyn Speer
a975bcedae update tests to include new languages
Also, it's easy to say `>=` in pytest
2018-06-12 17:55:44 -04:00
Robyn Speer
4b7e3d9655 bump version to 2.1; add test requirement for pytest 2018-06-12 17:48:24 -04:00
Robyn Speer
3259c4a375 Merge remote-tracking branch 'origin/pytest' into version2.1 2018-06-12 17:46:48 -04:00
Robyn Speer
d5f7335d90 New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00
Robyn Speer
b3c42be331 port remaining tests to pytest 2018-06-01 16:40:51 -04:00
Robyn Speer
75b4d62084 port test.py and test_chinese.py to pytest 2018-06-01 16:33:06 -04:00
Robyn Speer
6235d88869 Use data from fixed XC build - mostly changes Chinese 2018-05-30 13:09:20 -04:00
Robyn Speer
5762508e7c commit new data files (Italian changed for some reason) 2018-05-29 17:36:48 -04:00
Robyn Speer
e4cb9a23b6 update data to include xc's processing of ParaCrawl 2018-05-25 16:12:35 -04:00
Robyn Speer
8907423147 Packaging updates for the new PyPI
I _almost_ got the description and long_description right for 2.0.1. I
even checked it on the test server. But I didn't notice that I was
handling the first line of README.md specially, and ended up setting the
project description to "wordfreq is a Python library for looking up the
frequencies of words in many".

It'll be right in the next version.
2018-05-01 17:16:53 -04:00
Lance Nathan
316670a234 Merge pull request #56 from LuminosoInsight/japanese-edge-cases
Handle Japanese edge cases in `simple_tokenize`
2018-05-01 14:57:45 -04:00
Robyn Speer
e0da20b0c4 update CHANGELOG for 2.0.1 2018-05-01 14:47:55 -04:00
Robyn Speer
666f7e51fa Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Lance Nathan
18f176dbf6 Merge pull request #55 from LuminosoInsight/version2
Version 2, with standalone text pre-processing
2018-03-15 14:26:49 -04:00
Robyn Speer
d9bc4af8cd update the changelog 2018-03-14 17:56:29 -04:00
Robyn Speer
b2663272a7 remove LAUGHTER_WORDS, which is now unused
This was a fun Twitter test, but we don't do that anymore
2018-03-14 17:33:35 -04:00
Robyn Speer
65811d587e More explicit error message for a missing wordlist 2018-03-14 15:10:27 -04:00
Robyn Speer
2ecf31ee81 Actually use min_score in _language_in_list
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
2018-03-14 15:08:52 -04:00
Robyn Speer
c57032d5cb code review fixes to wordfreq.tokens 2018-03-14 15:07:45 -04:00
Robyn Speer
de81a23b9d code review fixes to __init__ 2018-03-14 15:04:59 -04:00
Robyn Speer
8656688b0b fix mention of dependencies in README 2018-03-14 15:01:08 -04:00
Robyn Speer
d68d4baad2 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Robyn Speer
0cb36aa74f cache the language info (avoids 10x slowdown) 2018-03-09 14:54:03 -05:00
Robyn Speer
b162de353d avoid log spam: only warn about an unsupported language once 2018-03-09 11:50:15 -05:00
Robyn Speer
c5f64a5de8 update the README 2018-03-08 18:16:15 -05:00