bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib#6572
Closed
ambv wants to merge 3 commits intopython:mainfrom
Closed
bpo-33338: [lib2to3] Synchronize token.py and tokenize.py with stdlib#6572ambv wants to merge 3 commits intopython:mainfrom
ambv wants to merge 3 commits intopython:mainfrom
Conversation
lib2to3's token.py and tokenize.py were initially copies of the respective files from the standard library. They were copied to allow Python 3 to read Python 2's grammar. Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for parsing Python 3 code. Additions to support Python 3 grammar were added but sadly, the main token.py and tokenize.py diverged. This change brings them back together, minimizing the differences to the bare minimum that is in fact required by lib2to3. Before this change, almost every line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this change, the diff between the two files is only 200 lines long and is entirely filled with relevant Python 2 compatibility bits. Merging the implementations, there's numerous fixes to the lib2to3 tokenizer: + docstrings made as similar as possible + ported `TokenInfo` + ported `tokenize.tokenize()` and `tokenize.open()` + removed Python 2-only implementation cruft + made Unicode identifier handling the same + made string prefix handling the same + added Ellipsis to the Special group + Untokenizer backported bugfixes: - 5e6db31 - 9dc3a36 - 5b8d2c3 - e411b66 - BPO-2495 + `detect_encoding` tries to figure out a filename and `find_cookie` uses the filename in error messages, if available + `find_cookie` bugfix: BPO-14990 + BPO-16152: tokenizer doesn't crash on missing newline at the end of the stream (added \Z (end of string) to PseudoExtras) Improvements to token.py: + taken from the current Lib/token.py + tokens renumbered to match Lib/token.py + `__all__` properly defined + ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number) + ELLIPSIS added + ENCODING added
benjaminp
reviewed
Apr 24, 2018
| '[' [listmaker] ']' | | ||
| '{' [dictsetmaker] '}' | | ||
| '`' testlist1 '`' | | ||
| NAME | NUMBER | STRING+ | '.' '.' '.') |
Contributor
There was a problem hiding this comment.
I'm pretty sure this is required to parse Python 2, where something like x[. ..] is a perfectly valid way of writing x[...].
Contributor
Author
There was a problem hiding this comment.
Sigh, you're right. I doubt anybody actually does this in practice though.
But yeah, for completion we'd have to retain atoms with three dots. Technically not a loss in functionality but definitely a loss in convenience for the programmer.
merwok
reviewed
Nov 22, 2018
| automatic encoding detection and yielding ENCODING tokens; - Unicode | ||
| identifiers are now supported; - ELLIPSIS is its own token type now; - | ||
| Untokenizer improved with backports of 5e6db31, 9dc3a36, 5b8d2c3, e411b66, | ||
| and BPO-2495. |
Member
There was a problem hiding this comment.
The problem was:
Warning, treated as error:
../build/NEWS:137:Bullet list ends without a blank line; unexpected unindent.
which I think means that this is the correct rst format:
- bla bla item one
second line is indentend
- bla bla second item after blank line
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(This is Step 1 in BPO-33337. See there for larger context.)
lib2to3's token.py and tokenize.py were initially copies of the respective
files from the standard library. They were copied to allow Python 3 to read
Python 2's grammar.
Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for
parsing Python 3 code. Additions to support Python 3 grammar were added but
sadly, the main token.py and tokenize.py diverged.
This change brings them back together, minimizing the differences to the bare
minimum that is in fact required by lib2to3. Before this change, almost every
line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this
change, the diff between the two files is only 200 lines long and is entirely
filled with relevant Python 2 compatibility bits.
Merging the implementations, there's numerous fixes to the lib2to3 tokenizer:
TokenInfotokenize.tokenize()andtokenize.open()detect_encodingtries to figure out a filename andfind_cookieusesthe filename in error messages, if available
find_cookiebugfix: BPO-14990stream (added \Z (end of string) to PseudoExtras)
Improvements to token.py:
__all__properly definedhttps://bugs.python.org/issue33338