mirror of
https://github.com/rspeer/wordfreq.git
synced 2024-12-23 09:21:37 +00:00
Update the "initial vowels" in French/Catalan
User LBeaudoux observed (https://github.com/LuminosoInsight/wordfreq/pull/82) that "Œ and œ should be considered as vowels that might appear at the start of a word in French". Further investigation of the French wordfreq list revealed words in the data starting with other vowels (such as d'yvonne, d'åland, l'ïle, d'özil). This PR is a combination of LBeaudoux's PR and the latter fact. (The updated regex is also used for Catalan, but should have no actual effect. To the best of our understanding, "y" appears in Catalan only in the digraph "ny" and in foreign words--the Catalan wordlist contains "york", "by", "city", several English names, and so forth, but no real Catalan words starting with "y"; cf "ioga", "iogurt". The wordlist in fact contained "l'fbi" and "l'nba", but cases of "l'" followed by a vowel like the ones found in French.)
This commit is contained in:
parent
c8229a5378
commit
a31deec580
@ -31,7 +31,7 @@ SPACELESS_EXPR = _make_spaceless_expr()
|
||||
|
||||
# All vowels that might appear at the start of a word in French or Catalan,
|
||||
# plus 'h' which would be silent and imply a following vowel sound.
|
||||
INITIAL_VOWEL_EXPR = '[AEHIOUÁÉÍÓÚÀÈÌÒÙÂÊÎÔÛaehiouáéíóúàèìòùâêîôû]'
|
||||
INITIAL_VOWEL_EXPR = '[AEHIOUYÁÉÍÓÚÀÈÌÒÙÂÊÎÔÛÅÏÖŒaehiouyáéíóúàèìòùâêîôûåïöœ]'
|
||||
|
||||
TOKEN_RE = regex.compile(
|
||||
r"""
|
||||
|
Loading…
Reference in New Issue
Block a user