wordfreq

mirror of https://github.com/rspeer/wordfreq.git synced 2024-12-24 18:01:38 +00:00

Author	SHA1	Message	Date
Robyn Speer	d5f7335d90	New data import from exquisite-corpus Significant changes in this data include: - Added ParaCrawl, a multilingual Web crawl, as a data source. This supplements the Leeds Web crawl with more modern data. ParaCrawl seems to provide a more balanced sample of Web pages than Common Crawl, which we once considered adding, but found that its data heavily overrepresented TripAdvisor and Urban Dictionary in a way that was very apparent in the word frequencies. ParaCrawl has a fairly subtle impact on the top terms, mostly boosting the frequencies of numbers and months. - Fixes to inconsistencies where words from different sources were going through different processing steps. As a result of these inconsistencies, some word lists contained words that couldn't actually be looked up because they would be normalized to something else. All words should now go through the aggressive normalization of `lossy_tokenize`. - Fixes to inconsistencies regarding what counts as a word. Non-punctuation, non-emoji symbols such as `=` were slipping through in some cases but not others. - As a result of the new data, Latvian becomes a supported language and Czech gets promoted to a 'large' language.	2018-06-12 17:22:43 -04:00
Robyn Speer	6235d88869	Use data from fixed XC build - mostly changes Chinese	2018-05-30 13:09:20 -04:00
Robyn Speer	5762508e7c	commit new data files (Italian changed for some reason)	2018-05-29 17:36:48 -04:00
Robyn Speer	e4cb9a23b6	update data to include xc's processing of ParaCrawl	2018-05-25 16:12:35 -04:00
Robyn Speer	666f7e51fa	Handle Japanese edge cases in simple_tokenize	2018-04-26 15:53:07 -04:00
Robyn Speer	65811d587e	More explicit error message for a missing wordlist	2018-03-14 15:10:27 -04:00
Robyn Speer	2ecf31ee81	Actually use `min_score` in `_language_in_list` We don't need to set it to any value but 80 now, but we will need to if we try to distinguish three kinds of Chinese (zh-Hans, zh-Hant, and unified zh-Hani).	2018-03-14 15:08:52 -04:00
Robyn Speer	c57032d5cb	code review fixes to wordfreq.tokens	2018-03-14 15:07:45 -04:00
Robyn Speer	de81a23b9d	code review fixes to __init__	2018-03-14 15:04:59 -04:00
Robyn Speer	d68d4baad2	Subtle changes to CJK frequencies This is the result of re-running exquisite-corpus via wordfreq 2. The frequencies for most languages were identical. Small changes that move words by a few places in the list appeared in Chinese, Japanese, and Korean. There are also even smaller changes in Bengali and Hindi. The source of the CJK change is that Roman letters are case-folded _before_ Jieba or MeCab tokenization, which changes their output in a few cases. In Hindi, one word changed frequency in the top 500. In Bengali, none of those words changed frequency, but the data file is still different. I'm not sure I have such a solid explanation here, except that these languages use the regex tokenizer, and we just updated the regex dependency, which could affect some edge cases of these languages.	2018-03-14 11:36:02 -04:00
Robyn Speer	0cb36aa74f	cache the language info (avoids 10x slowdown)	2018-03-09 14:54:03 -05:00
Robyn Speer	b162de353d	avoid log spam: only warn about an unsupported language once	2018-03-09 11:50:15 -05:00
Robyn Speer	d8e3669a73	wordlist updates from new exquisite-corpus	2018-03-08 18:16:00 -05:00
Robyn Speer	8e3dff3c1c	Traditional Chinese should be preserved through tokenization	2018-03-08 18:08:55 -05:00
Robyn Speer	45064a292f	reorganize wordlists into 'small', 'large', and 'best'	2018-03-08 17:52:44 -05:00
Robyn Speer	fe85b4e124	fix az-Latn transliteration, and test	2018-03-08 16:47:36 -05:00
Robyn Speer	5ab5d2ea55	Separate preprocessing from tokenization	2018-03-08 16:26:17 -05:00
Robyn Speer	46e32fbd36	v1.7: update tokenization, update data, add `bn` and `mk`	2017-08-25 17:37:48 -04:00
Robyn Speer	9dac967ca3	Tokenize by graphemes, not codepoints (#50 ) * Tokenize by graphemes, not codepoints * Add more documentation to TOKEN_RE * Remove extra line break * Update docstring - Brahmic scripts are no longer an exception * approve using version 2017.07.28 of regex	2017-08-08 11:35:28 -04:00
Robyn Speer	71a0ad6abb	Use langcodes when tokenizing again (it no longer connects to a DB)	2017-04-27 15:09:59 -04:00
Andrew Lin	1363f9d2e0	Correct a case in transliterate.py.	2017-02-14 13:08:23 -05:00
Robyn Speer	7dec335f74	describe the current problem with 'cyrtranslit' as a dependency	2017-01-31 18:25:52 -05:00
Robyn Speer	abd0820a32	Handle smashing numbers only at the end of tokenize(). This does make the code a lot clearer.	2017-01-11 19:04:19 -05:00
Robyn Speer	573ecc53d0	Don't smash numbers in all tokenization, just when looking up freqs I forgot momentarily that the output of the tokenizer is used by other code.	2017-01-06 19:18:52 -05:00
Robyn Speer	23c7c8e936	update data from Exquisite Corpus in English and Swedish	2017-01-05 19:17:51 -05:00
Robyn Speer	7dc3f03ebd	import new wordlists from Exquisite Corpus	2017-01-05 17:59:26 -05:00
Robyn Speer	d66d04210f	transliterate: organize the 'borrowed letters' better	2017-01-05 13:23:20 -05:00
Robyn Speer	87b03325db	transliterate: Handle unexpected Russian invasions	2017-01-04 18:51:00 -05:00
Robyn Speer	6211b35fb3	Add transliteration of Cyrillic Serbian	2016-12-29 18:27:17 -05:00
Robyn Speer	0aa7ad46ae	fixes to tokenization	2016-12-13 14:43:29 -05:00
Robyn Speer	d6d528de74	Replace multi-digit sequences with zeroes	2016-12-09 15:55:08 -05:00
Robyn Speer	21a78f5eb9	Bake the 'h special case into the regex This lets me remove the French-specific code I just put in.	2016-12-06 17:37:35 -05:00
Robyn Speer	596368ac6e	fix tokenization of words like "l'heure"	2016-12-05 18:54:51 -05:00
Robyn Speer	e1d6e7d96f	Allow MeCab to work in Japanese or Korean without the other	2016-08-19 11:41:35 -04:00
Robyn Speer	88c93f6204	Remove unnecessary variable from make_mecab_analyzer Former-commit-id: `548162c563`	2016-08-04 15:17:02 -04:00
Robyn Speer	6440d81676	consolidate logic about MeCab path length Former-commit-id: `2b984937be`	2016-08-04 15:16:20 -04:00
Robyn Speer	c11998e506	Getting a newer mecab-ko-dic changed the Korean frequencies Former-commit-id: `894a96ba7e`	2016-08-02 16:10:41 -04:00
Robyn Speer	bc1cfc35c8	update find_mecab_dictionary docstring Former-commit-id: `8a5d1b298d`	2016-08-02 12:53:46 -04:00
Robyn Speer	9e55f8fed1	remove my ad-hoc names for dictionary packages Former-commit-id: `3dffb18557`	2016-08-01 17:39:35 -04:00
Robyn Speer	2787bfd647	stop including MeCab dictionaries in the package Former-commit-id: `b3dd8479ab`	2016-08-01 17:37:41 -04:00
Robyn Speer	875dd5669f	fix MeCab error message Former-commit-id: `fcf2445c3e`	2016-07-29 17:30:02 -04:00
Robyn Speer	94712c8312	Look for MeCab dictionaries in various places besides this package Former-commit-id: `afe6537994`	2016-07-29 17:27:15 -04:00
Robyn Speer	15667ea023	Code review fixes: avoid repeatedly constructing sets Former-commit-id: `1a16b0f84c`	2016-07-29 12:32:26 -04:00
Robyn Speer	2a41d4dc5e	Add Common Crawl data and more languages (#39 ) This changes the version from 1.4.2 to 1.5. Things done in this update include: * include Common Crawl; support 11 more languages * new frequency-merging strategy * New sources: Chinese from Wikipedia (mostly Trad.), Dutch big list * Remove kinda bad sources, i.e. Greek Twitter (too often kaomoji are detected as Greek) and Ukrainian Common Crawl. This results in dropping Ukrainian as an available language, and causing Greek to not be a 'large' language after all. * Add Korean tokenization, and include MeCab files in data * Remove marks from more languages * Deal with commas and cedillas in Turkish and Romanian Former-commit-id: `e6a8f028e3`	2016-07-28 19:23:17 -04:00
Robyn Speer	0a2bfb2710	Tokenization in Korean, plus abjad languages (#38 ) * Remove marks from more languages * Add Korean tokenization, and include MeCab files in data * add a Hebrew tokenization test * fix terminology in docstrings about abjad scripts * combine Japanese and Korean tokenization into the same function Former-commit-id: `fec6eddcc3`	2016-07-15 15:10:25 -04:00
Robyn Speer	3155cf27e6	Fix tokenization of SE Asian and South Asian scripts (#37 ) Former-commit-id: `270f6c7ca6`	2016-07-01 18:00:57 -04:00
Robyn Speer	a0d93e0ce8	limit Reddit data to just English Former-commit-id: `2276d97368`	2016-04-15 17:01:21 -04:00
Robyn Speer	cecf852040	update wordlists for new builder settings Former-commit-id: `a10c1d7ac0`	2016-03-28 12:26:47 -04:00
Robyn Speer	0c7527140c	Discard text detected as an uncommon language; add large German list Former-commit-id: `abbc295538`	2016-03-28 12:26:02 -04:00
Robyn Speer	fbc19995ab	Merge remote-tracking branch 'origin/master' into big-list Conflicts: wordfreq_builder/wordfreq_builder/cli/merge_counts.py Former-commit-id: `164a5b1a05`	2016-03-24 14:11:44 -04:00

1 2 3 4 5

209 Commits