gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639
Draft
vedant713 wants to merge 3 commits intopython:mainfrom
Draft
gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639vedant713 wants to merge 3 commits intopython:mainfrom
vedant713 wants to merge 3 commits intopython:mainfrom
Conversation
…deEncodeError on Windows The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues. This commit introduces a `normalize_surrogates()` helper in `Reader` to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The `get_unicode()` method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text. This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows. Fixes python#136595
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Member
|
This implementation fails if there are lone surrogate characters. Even after fixing this, it will not completely solve the original issue for the case of lone surrogate characters -- we need to handle this at the encoding to UTF-8 step. See also a different (regular expression based) implementation in #121219. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.
This commit introduces a
normalize_surrogates()helper inReaderto explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. Theget_unicode()method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.
Fixes #136595