Commit Graph

209 Commits

Author SHA1 Message Date
Robyn Speer
d5f7335d90 New data import from exquisite-corpus
Significant changes in this data include:

- Added ParaCrawl, a multilingual Web crawl, as a data source.
  This supplements the Leeds Web crawl with more modern data.

  ParaCrawl seems to provide a more balanced sample of Web pages than
  Common Crawl, which we once considered adding, but found that its data
  heavily overrepresented TripAdvisor and Urban Dictionary in a way that
  was very apparent in the word frequencies.

  ParaCrawl has a fairly subtle impact on the top terms, mostly boosting
  the frequencies of numbers and months.

- Fixes to inconsistencies where words from different sources were going
  through different processing steps. As a result of these
  inconsistencies, some word lists contained words that couldn't
  actually be looked up because they would be normalized to something
  else.

  All words should now go through the aggressive normalization of
  `lossy_tokenize`.

- Fixes to inconsistencies regarding what counts as a word.
  Non-punctuation, non-emoji symbols such as `=` were slipping through
  in some cases but not others.

- As a result of the new data, Latvian becomes a supported language and
  Czech gets promoted to a 'large' language.
2018-06-12 17:22:43 -04:00
Robyn Speer
6235d88869 Use data from fixed XC build - mostly changes Chinese 2018-05-30 13:09:20 -04:00
Robyn Speer
5762508e7c commit new data files (Italian changed for some reason) 2018-05-29 17:36:48 -04:00
Robyn Speer
e4cb9a23b6 update data to include xc's processing of ParaCrawl 2018-05-25 16:12:35 -04:00
Robyn Speer
666f7e51fa Handle Japanese edge cases in simple_tokenize 2018-04-26 15:53:07 -04:00
Robyn Speer
65811d587e More explicit error message for a missing wordlist 2018-03-14 15:10:27 -04:00
Robyn Speer
2ecf31ee81 Actually use min_score in _language_in_list
We don't need to set it to any value but 80 now, but we will need to if
we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and
unified zh-Hani).
2018-03-14 15:08:52 -04:00
Robyn Speer
c57032d5cb code review fixes to wordfreq.tokens 2018-03-14 15:07:45 -04:00
Robyn Speer
de81a23b9d code review fixes to __init__ 2018-03-14 15:04:59 -04:00
Robyn Speer
d68d4baad2 Subtle changes to CJK frequencies
This is the result of re-running exquisite-corpus via wordfreq 2.  The
frequencies for most languages were identical. Small changes that move
words by a few places in the list appeared in Chinese, Japanese, and
Korean. There are also even smaller changes in Bengali and Hindi.

The source of the CJK change is that Roman letters are case-folded
_before_ Jieba or MeCab tokenization, which changes their output in a
few cases.

In Hindi, one word changed frequency in the top 500. In Bengali, none of
those words changed frequency, but the data file is still different.
I'm not sure I have such a solid explanation here, except that these
languages use the regex tokenizer, and we just updated the regex
dependency, which could affect some edge cases of these languages.
2018-03-14 11:36:02 -04:00
Robyn Speer
0cb36aa74f cache the language info (avoids 10x slowdown) 2018-03-09 14:54:03 -05:00
Robyn Speer
b162de353d avoid log spam: only warn about an unsupported language once 2018-03-09 11:50:15 -05:00
Robyn Speer
d8e3669a73 wordlist updates from new exquisite-corpus 2018-03-08 18:16:00 -05:00
Robyn Speer
8e3dff3c1c Traditional Chinese should be preserved through tokenization 2018-03-08 18:08:55 -05:00
Robyn Speer
45064a292f reorganize wordlists into 'small', 'large', and 'best' 2018-03-08 17:52:44 -05:00
Robyn Speer
fe85b4e124 fix az-Latn transliteration, and test 2018-03-08 16:47:36 -05:00
Robyn Speer
5ab5d2ea55 Separate preprocessing from tokenization 2018-03-08 16:26:17 -05:00
Robyn Speer
46e32fbd36 v1.7: update tokenization, update data, add bn and mk 2017-08-25 17:37:48 -04:00
Robyn Speer
9dac967ca3 Tokenize by graphemes, not codepoints (#50)
* Tokenize by graphemes, not codepoints

* Add more documentation to TOKEN_RE

* Remove extra line break

* Update docstring - Brahmic scripts are no longer an exception

* approve using version 2017.07.28 of regex
2017-08-08 11:35:28 -04:00
Robyn Speer
71a0ad6abb Use langcodes when tokenizing again (it no longer connects to a DB) 2017-04-27 15:09:59 -04:00
Andrew Lin
1363f9d2e0 Correct a case in transliterate.py. 2017-02-14 13:08:23 -05:00
Robyn Speer
7dec335f74 describe the current problem with 'cyrtranslit' as a dependency 2017-01-31 18:25:52 -05:00
Robyn Speer
abd0820a32 Handle smashing numbers only at the end of tokenize().
This does make the code a lot clearer.
2017-01-11 19:04:19 -05:00
Robyn Speer
573ecc53d0 Don't smash numbers in *all* tokenization, just when looking up freqs
I forgot momentarily that the output of the tokenizer is used by other
code.
2017-01-06 19:18:52 -05:00
Robyn Speer
23c7c8e936 update data from Exquisite Corpus in English and Swedish 2017-01-05 19:17:51 -05:00
Robyn Speer
7dc3f03ebd import new wordlists from Exquisite Corpus 2017-01-05 17:59:26 -05:00
Robyn Speer
d66d04210f transliterate: organize the 'borrowed letters' better 2017-01-05 13:23:20 -05:00
Robyn Speer
87b03325db transliterate: Handle unexpected Russian invasions 2017-01-04 18:51:00 -05:00
Robyn Speer
6211b35fb3 Add transliteration of Cyrillic Serbian 2016-12-29 18:27:17 -05:00
Robyn Speer
0aa7ad46ae fixes to tokenization 2016-12-13 14:43:29 -05:00
Robyn Speer
d6d528de74 Replace multi-digit sequences with zeroes 2016-12-09 15:55:08 -05:00
Robyn Speer
21a78f5eb9 Bake the 'h special case into the regex
This lets me remove the French-specific code I just put in.
2016-12-06 17:37:35 -05:00
Robyn Speer
596368ac6e fix tokenization of words like "l'heure" 2016-12-05 18:54:51 -05:00
Robyn Speer
e1d6e7d96f Allow MeCab to work in Japanese or Korean without the other 2016-08-19 11:41:35 -04:00
Robyn Speer
88c93f6204 Remove unnecessary variable from make_mecab_analyzer
Former-commit-id: 548162c563
2016-08-04 15:17:02 -04:00
Robyn Speer
6440d81676 consolidate logic about MeCab path length
Former-commit-id: 2b984937be
2016-08-04 15:16:20 -04:00
Robyn Speer
c11998e506 Getting a newer mecab-ko-dic changed the Korean frequencies
Former-commit-id: 894a96ba7e
2016-08-02 16:10:41 -04:00
Robyn Speer
bc1cfc35c8 update find_mecab_dictionary docstring
Former-commit-id: 8a5d1b298d
2016-08-02 12:53:46 -04:00
Robyn Speer
9e55f8fed1 remove my ad-hoc names for dictionary packages
Former-commit-id: 3dffb18557
2016-08-01 17:39:35 -04:00
Robyn Speer
2787bfd647 stop including MeCab dictionaries in the package
Former-commit-id: b3dd8479ab
2016-08-01 17:37:41 -04:00
Robyn Speer
875dd5669f fix MeCab error message
Former-commit-id: fcf2445c3e
2016-07-29 17:30:02 -04:00
Robyn Speer
94712c8312 Look for MeCab dictionaries in various places besides this package
Former-commit-id: afe6537994
2016-07-29 17:27:15 -04:00
Robyn Speer
15667ea023 Code review fixes: avoid repeatedly constructing sets
Former-commit-id: 1a16b0f84c
2016-07-29 12:32:26 -04:00
Robyn Speer
2a41d4dc5e Add Common Crawl data and more languages (#39)
This changes the version from 1.4.2 to 1.5.  Things done in this update include:

* include Common Crawl; support 11 more languages

* new frequency-merging strategy

* New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list

* Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all.

* Add Korean tokenization, and include MeCab files in data

* Remove marks from more languages

* Deal with commas and cedillas in Turkish and Romanian



Former-commit-id: e6a8f028e3
2016-07-28 19:23:17 -04:00
Robyn Speer
0a2bfb2710 Tokenization in Korean, plus abjad languages (#38)
* Remove marks from more languages

* Add Korean tokenization, and include MeCab files in data

* add a Hebrew tokenization test

* fix terminology in docstrings about abjad scripts

* combine Japanese and Korean tokenization into the same function


Former-commit-id: fec6eddcc3
2016-07-15 15:10:25 -04:00
Robyn Speer
3155cf27e6 Fix tokenization of SE Asian and South Asian scripts (#37)
Former-commit-id: 270f6c7ca6
2016-07-01 18:00:57 -04:00
Robyn Speer
a0d93e0ce8 limit Reddit data to just English
Former-commit-id: 2276d97368
2016-04-15 17:01:21 -04:00
Robyn Speer
cecf852040 update wordlists for new builder settings
Former-commit-id: a10c1d7ac0
2016-03-28 12:26:47 -04:00
Robyn Speer
0c7527140c Discard text detected as an uncommon language; add large German list
Former-commit-id: abbc295538
2016-03-28 12:26:02 -04:00
Robyn Speer
fbc19995ab Merge remote-tracking branch 'origin/master' into big-list
Conflicts:
	wordfreq_builder/wordfreq_builder/cli/merge_counts.py

Former-commit-id: 164a5b1a05
2016-03-24 14:11:44 -04:00