Update the "initial vowels" in French/Catalan

User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82)
that "Œ and œ should be considered as vowels that might appear at the start of
a word in French".  Further investigation of the French wordfreq list revealed
words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle,
d'özil).  This PR is a combination of LBeaudoux's PR and the latter fact.

(The updated regex is also used for Catalan, but should have no actual effect.
To the best of our understanding, "y" appears in Catalan only in the digraph
"ny" and in foreign words--the Catalan wordlist contains "york", "by", "city",
several English names, and so forth, but no real Catalan words starting with
"y"; cf "ioga", "iogurt".  The wordlist in fact contained "l'fbi" and "l'nba",
but cases of "l'" followed by a vowel like the ones found in French.)
This commit is contained in:
Lance Nathan 2020-10-08 12:23:22 -04:00
parent c8229a5378
commit a31deec580

View File

@ -31,7 +31,7 @@ SPACELESS_EXPR = _make_spaceless_expr()
# All vowels that might appear at the start of a word in French or Catalan,
# plus 'h' which would be silent and imply a following vowel sound.
INITIAL_VOWEL_EXPR = '[AEHIOUÁÉÍÓÚÀÈÌÒÙÂÊÎÔÛaehiouáéíóúàèìòùâêîôû]'
INITIAL_VOWEL_EXPR = '[AEHIOUYÁÉÍÓÚÀÈÌÒÙÂÊÎÔÛÅÏÖŒaehiouyáéíóúàèìòùâêîôûåïöœ]'
TOKEN_RE = regex.compile(
r"""