wordfreq/scripts/gen_regex.py

import unicodedata
from ftfy import chardata
import pathlib
from pkg_resources import resource_filename


CATEGORIES = [unicodedata.category(chr(i)) for i in range(0x110000)]
DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))


def func_to_regex(accept_func):
    """
    Given a function that returns True or False for a numerical codepoint,
    return a regex character class accepting the characters resulting in True.
    Ranges separated only by unassigned characters are merged for efficiency.
    """
    # Where the last range would end if it also included unassigned codepoints.
    # If we need to add a codepoint right after this point, we extend the
    # range; otherwise we start a new one.
    tentative_end = None
    ranges = []

    for codepoint, category in enumerate(CATEGORIES):
        if accept_func(codepoint):
            if tentative_end == codepoint - 1:
                ranges[-1][1] = codepoint
            else:
                ranges.append([codepoint, codepoint])
            tentative_end = codepoint
        elif category == 'Cn' and tentative_end == codepoint - 1:
            tentative_end = codepoint

    return '[%s]' % ''.join(chr(r[0]) + '-' + chr(r[1]) for r in ranges)


def cache_regex_from_func(filename, func):
    """
    Generates a regex from a function that accepts a single unicode character,
    and caches it in the data path at filename.
    """
    with (DATA_PATH / filename).open(mode='w') as file:
        file.write(func_to_regex(func))


def _is_emoji_codepoint(i):
    """
    Report whether a numerical codepoint is (likely) an emoji: a Unicode 'So'
    character (as future-proofed by the ftfy chardata module) but excluding
    symbols like © and ™ below U+2600 and the replacement character U+FFFD.
    """
    return chardata.CHAR_CLASS_STRING[i] == '3' and i >= 0x2600 and i != 0xfffd


def _is_non_punct_codepoint(i):
    """
    Report whether a numerical codepoint is not one of the following classes:
    - P: punctuation
    - S: symbols
    - Z: separators
    - C: control characters
    This will classify symbols, including emoji, as punctuation; users that
    want to accept emoji should add them separately.
    """
    return CATEGORIES[i][0] not in 'PSZC'


def _is_combining_mark_codepoint(i):
    """
    Report whether a numerical codepoint is a combining mark (Unicode 'M').
    """
    return CATEGORIES[i][0] == 'M'


if __name__ == '__main__':
    cache_regex_from_func('emoji.txt', _is_emoji_codepoint)
    cache_regex_from_func('non_punct.txt', _is_non_punct_codepoint)
    cache_regex_from_func('combining_mark.txt', _is_combining_mark_codepoint)
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00			`import unicodedata`
fixed gen_regex Former-commit-id: 5510fce675c8008ddd28b3070557b5669ab27b5e 2015-07-07 19:22:04 +00:00			`from ftfy import chardata`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00			`import pathlib`
fixed gen_regex Former-commit-id: 5510fce675c8008ddd28b3070557b5669ab27b5e 2015-07-07 19:22:04 +00:00			`from pkg_resources import resource_filename`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00
fixed spacing Former-commit-id: ae4699029d3b09621ac410c26b981266056f1747 2015-07-07 19:23:15 +00:00
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`CATEGORIES = [unicodedata.category(chr(i)) for i in range(0x110000)]`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00			`DATA_PATH = pathlib.Path(resource_filename('wordfreq', 'data'))`
updated imports Former-commit-id: f2b615b0f04d409a2a2bcf46433580a2dbea7fc5 2015-07-07 18:46:42 +00:00
fixed spacing Former-commit-id: ae4699029d3b09621ac410c26b981266056f1747 2015-07-07 19:23:15 +00:00
Improve variable names. Former-commit-id: 95da6985d466276aad850926188fa0f6b05a3d1f 2015-07-10 18:02:33 +00:00			`def func_to_regex(accept_func):`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`"""`
			`Given a function that returns True or False for a numerical codepoint,`
			`return a regex character class accepting the characters resulting in True.`
			`Ranges separated only by unassigned characters are merged for efficiency.`
			`"""`
Clarify the algorithm for range calculation using an explicit variable. Former-commit-id: 6755741e7d004823a6767a7b83122ea675b81165 2015-07-09 20:47:33 +00:00			`# Where the last range would end if it also included unassigned codepoints.`
			`# If we need to add a codepoint right after this point, we extend the`
			`# range; otherwise we start a new one.`
			`tentative_end = None`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`ranges = []`

Improve variable names. Former-commit-id: 95da6985d466276aad850926188fa0f6b05a3d1f 2015-07-10 18:02:33 +00:00			`for codepoint, category in enumerate(CATEGORIES):`
			`if accept_func(codepoint):`
			`if tentative_end == codepoint - 1:`
			`ranges[-1][1] = codepoint`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`else:`
Improve variable names. Former-commit-id: 95da6985d466276aad850926188fa0f6b05a3d1f 2015-07-10 18:02:33 +00:00			`ranges.append([codepoint, codepoint])`
			`tentative_end = codepoint`
			`elif category == 'Cn' and tentative_end == codepoint - 1:`
			`tentative_end = codepoint`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00
			`return '[%s]' % ''.join(chr(r[0]) + '-' + chr(r[1]) for r in ranges)`


cleaned up gen regex Former-commit-id: 27ea107e6fc0f8e95519728565dd5618d7e8c0d2 2015-07-07 20:00:24 +00:00			`def cache_regex_from_func(filename, func):`
			`"""`
			`Generates a regex from a function that accepts a single unicode character,`
			`and caches it in the data path at filename.`
			`"""`
Whoops -- put back 'file' as a variable name. (The perils of trusting syntax highlighting...) Former-commit-id: f591e74663ecf79bad8822055a18a81f158eea0d 2015-07-09 20:18:56 +00:00			`with (DATA_PATH / filename).open(mode='w') as file:`
			`file.write(func_to_regex(func))`
cleaned up gen regex Former-commit-id: 27ea107e6fc0f8e95519728565dd5618d7e8c0d2 2015-07-07 20:00:24 +00:00

Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`def _is_emoji_codepoint(i):`
updated imports Former-commit-id: f2b615b0f04d409a2a2bcf46433580a2dbea7fc5 2015-07-07 18:46:42 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`Report whether a numerical codepoint is (likely) an emoji: a Unicode 'So'`
			`character (as future-proofed by the ftfy chardata module) but excluding`
			`symbols like © and ™ below U+2600 and the replacement character U+FFFD.`
updated imports Former-commit-id: f2b615b0f04d409a2a2bcf46433580a2dbea7fc5 2015-07-07 18:46:42 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`return chardata.CHAR_CLASS_STRING[i] == '3' and i >= 0x2600 and i != 0xfffd`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00
fixed spacing Former-commit-id: ae4699029d3b09621ac410c26b981266056f1747 2015-07-07 19:23:15 +00:00
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`def _is_non_punct_codepoint(i):`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`Report whether a numerical codepoint is not one of the following classes:`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00			`- P: punctuation`
			`- S: symbols`
			`- Z: separators`
			`- C: control characters`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`This will classify symbols, including emoji, as punctuation; users that`
			`want to accept emoji should add them separately.`
updated gen_regex to be run as script Former-commit-id: 22fbea424841cbd7c5181be65df224c1f6b6e971 2015-07-07 18:50:56 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`return CATEGORIES[i][0] not in 'PSZC'`
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00
fixed spacing Former-commit-id: ae4699029d3b09621ac410c26b981266056f1747 2015-07-07 19:23:15 +00:00
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`def _is_combining_mark_codepoint(i):`
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`Report whether a numerical codepoint is a combining mark (Unicode 'M').`
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00			`"""`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`return CATEGORIES[i][0] == 'M'`
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00
fixed spacing Former-commit-id: ae4699029d3b09621ac410c26b981266056f1747 2015-07-07 19:23:15 +00:00
factored out regex generation Former-commit-id: 476a909e4d68a7fe79244620441e3400124925e0 2015-07-07 18:38:21 +00:00			`if __name__ == '__main__':`
Tweaks to the regex generator for brevity: * Don't repeat the logic that generates the ranges * Include only unassigned characters between two accepted ranges; this causes the resulting regexes to be a bit more readable. * Rearrange the script itself to avoid long lambdas and group helper functions together * Precompute the list of all the character classes for speed and terseness Former-commit-id: cc6920d7e45344fd4eec2b73737f238b5014ef9b 2015-07-08 19:29:31 +00:00			`cache_regex_from_func('emoji.txt', _is_emoji_codepoint)`
			`cache_regex_from_func('non_punct.txt', _is_non_punct_codepoint)`
			`cache_regex_from_func('combining_mark.txt', _is_combining_mark_codepoint)`