constant.numeric.dec.python only has 2 capture groups #198

asottile · 2020-02-22T05:08:22Z

also validated this against oniguruma via onigurumacffi

>>> REG = '''\
...        (?x)
...          (?<![\w\.])(?:
...              [1-9](?: _?[0-9] )*
...              |
...              0+
...              |
...              [0-9](?: _?[0-9] )* ([jJ])
...              |
...              0 ([0-9]+)(?![eE\.])
...          )\b
... '''
>>> import onigurumacffi
>>> reg = onigurumacffi.compile(REG)
>>> reg.number_of_captures()
2

asottile · 2020-02-22T05:08:37Z

actually, on second thought -- I should probably check the rest of the regexen too 🤔

asottile · 2020-02-22T05:29:55Z

ok there are more, but I am having a hard time wrapping my head around the include scheme

here's the script I used to find them (can probably be improved a little bit to have more/better output):

import argparse
import re
import plistlib
from typing import Any
from typing import Dict

import onigurumacffi

_BACKREF_RE = re.compile(r'((?<!\\)(?:\\\\)*)\\([0-9]+)')


def _fix_end(s: str) -> str:
    """end can have backreferences"""
    return _BACKREF_RE.sub('ZZZ', s)


def _visit_captures(reg: str, captures: Dict[str, Dict[str, Any]]) -> None:
    max_n = onigurumacffi.compile(reg).number_of_captures()
    for k, v in captures.items():
        if int(k) > max_n:
            print(f'{k} > {max_n}: {reg!r} {v}')
        _visit_rule(v)


def _visit_rule(rule: Dict[str, Any]) -> None:
    if 'match' in rule and 'captures' in rule:
        _visit_captures(rule['match'], rule['captures'])
    if 'begin' in rule:
        if 'captures' in rule:
            _visit_captures(rule['begin'], rule['captures'])
            _visit_captures(_fix_end(rule['end']), rule['captures'])
        if 'beginCaptures' in rule:
            _visit_captures(rule['begin'], rule['beginCaptures'])
        if 'endCaptures' in rule:
            _visit_captures(_fix_end(rule['end']), rule['endCaptures'])

    for sub_rule in rule.get('patterns', ()):
        _visit_rule(sub_rule)

    for sub_rule in rule.get('repository', {}).values():
        _visit_rule(sub_rule)


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument('filename')
    args = parser.parse_args()

    with open(args.filename, 'rb') as f:
        contents = plistlib.load(f)

    _visit_rule(contents)
    return 0


if __name__ == '__main__':
    exit(main())

$ python3 t.py grammars/MagicPython.tmLanguage 
2 > 1: "(\\]|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\]|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\'\\'\\')" {'name': 'invalid.illegal.newline.python'}
2 > 1: '(""")' {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\)|(?=\\'\\'\\'))" {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(\\)|(?="""))' {'name': 'invalid.illegal.newline.python'}
2 > 1: "(\\'\\'\\')" {'name': 'invalid.illegal.newline.python'}
2 > 1: '(""")' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}
2 > 1: '(ZZZ)' {'name': 'invalid.illegal.newline.python'}

asottile · 2020-04-22T21:25:37Z

@elprans @vpetrovykh anything left to do here?

asottile · 2020-07-30T17:40:25Z

@1st1 maybe?

vpetrovykh · 2020-07-31T08:22:52Z

OK, the change looks good.

As for the other hits, all of the invalid.illegal.newline.python are, as far as I can tell, the product of us using templates in our syntax.yaml files to build various flavors of strings and regexp rules. For single-line strings we have a rule for detecting the illegal newline (unterminated string), but there's no such thing for the multi-line variants, however the template still has the match group and the scope assigned to that. So barring making a more elaborate template system we'll probably keep the current setup as it appears to do no harm.

Sorry for the delay in merging this.

constant.numeric.dec.python only has 2 capture groups

Loading status checks…

bbc1a54

elprans requested a review from vpetrovykh Feb 22, 2020

elprans assigned vpetrovykh Feb 22, 2020

vpetrovykh merged commit 225fa4c into MagicStack:master Jul 31, 2020
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

asottile deleted the asottile:dec_illegal_captures branch Jul 31, 2020

Nov	DEC	Jan
	06
2019	2020	2021

MagicStack / MagicPython

constant.numeric.dec.python only has 2 capture groups #198

constant.numeric.dec.python only has 2 capture groups #198

asottile commented Feb 22, 2020

asottile commented Feb 22, 2020

asottile commented Feb 22, 2020

asottile commented Apr 22, 2020

asottile commented Jul 30, 2020

vpetrovykh commented Jul 31, 2020 •

edited

MagicStack / MagicPython

Join GitHub today

GitHub is where the world builds software

constant.numeric.dec.python only has 2 capture groups #198

constant.numeric.dec.python only has 2 capture groups #198

Conversation

asottile commented Feb 22, 2020

asottile commented Feb 22, 2020

asottile commented Feb 22, 2020

asottile commented Apr 22, 2020

asottile commented Jul 30, 2020

vpetrovykh commented Jul 31, 2020 • edited

Essential cookies

Always active

Analytics cookies

vpetrovykh commented Jul 31, 2020 •

edited