Message 358458 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	maxking
Recipients	barry, maxking, mkaiser, r.david.murray
Date	2019-12-15.23:05:17
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<[email protected]>
In-reply-to

Content
I tried to take a look at the code to see where the fix needs to be and I probably need some help. I looked at the parse tree for the header and it looks something like this: ContentDisposition([Token([ValueTerminal('attachment')]), ValueTerminal(';'), MimeParameters([Parameter([Attribute([CFWSList([WhiteSpaceTerminal(' ')]), ValueTerminal('filename')]), ValueTerminal('='), Value([QuotedString([BareQuotedString([EncodedWord([ValueTerminal('Schulbesuchsbestättigung.')]), WhiteSpaceTerminal(' '), EncodedWord([ValueTerminal('pdf')])])])])])])]) The offending piece of code, which seems to be working as designed is get_bare_quoted_string() in email/_header_value_parser.py. while value and value[0] != '"': if value[0] in WSP: token, value = get_fws(value) elif value[:2] == '=?': try: token, value = get_encoded_word(value) bare_quoted_string.defects.append(errors.InvalidHeaderDefect( "encoded word inside quoted string")) except errors.HeaderParseError: token, value = get_qcontent(value) else: token, value = get_qcontent(value) bare_quoted_string.append(token) It just loops and parses the values. We cannot ignore the FWS until we know that the atom before and after the FWS are encoded words. I can't seem to find a clean way to look-ahead (which can perhaps be used in get_parameters()) or look-back (which can be used after parsing the entire bare_quoted_string?) in the parse tree to delete the offending whitespace. Any example of such kind of parse-tree manipulation in the code base would be awesome!

I tried to take a look at the code to see where the fix needs to be and I probably need some help.

I looked at the parse tree for the header and it looks something like this:

ContentDisposition([Token([ValueTerminal('attachment')]), ValueTerminal(';'), MimeParameters([Parameter([Attribute([CFWSList([WhiteSpaceTerminal(' ')]), ValueTerminal('filename')]), ValueTerminal('='), Value([QuotedString([BareQuotedString([EncodedWord([ValueTerminal('Schulbesuchsbestättigung.')]), WhiteSpaceTerminal('    '), EncodedWord([ValueTerminal('pdf')])])])])])])])


The offending piece of code, which seems to be working as designed is get_bare_quoted_string() in email/_header_value_parser.py. 

    while value and value[0] != '"':
        if value[0] in WSP:
            token, value = get_fws(value)
        elif value[:2] == '=?':
            try:
                token, value = get_encoded_word(value)
                bare_quoted_string.defects.append(errors.InvalidHeaderDefect(
                    "encoded word inside quoted string"))
            except errors.HeaderParseError:
                token, value = get_qcontent(value)
        else:
            token, value = get_qcontent(value)
        bare_quoted_string.append(token)

It just loops and parses the values. We cannot ignore the FWS until we know that the atom before and after the FWS are encoded words. I can't seem to find a clean way to look-ahead (which can perhaps be used in get_parameters()) or look-back (which can be used after parsing the entire bare_quoted_string?) in the parse tree to delete the offending whitespace. 

Any example of such kind of parse-tree manipulation in the code base would be awesome!

History
Date	User	Action	Args
2019-12-15 23:05:17	maxking	set	recipients: + maxking, barry, r.david.murray, mkaiser
2019-12-15 23:05:17	maxking	set	messageid: <[email protected]>
2019-12-15 23:05:17	maxking	link	issue39040 messages
2019-12-15 23:05:17	maxking	create