Impl more codecs and _codecs._unregister_error by youknowone · Pull Request #7025 · RustPython/RustPython

youknowone · 2026-02-06T03:08:07Z

Summary by CodeRabbit

New Features
- Added ability to unregister custom codec error handlers (operation returns success/failure and protects built-in handlers)
Bug Fixes
- Improved error messages with codec-specific context during encode/decode operations
- Updated encoding name normalization and UTF-8 alias handling
- Ensured codec error notes are preserved across lookups and caching

coderabbitai · 2026-02-06T03:08:17Z

📝 Walkthrough

Walkthrough

Adds an API to unregister codec error handlers, prevents removing built-ins, and returns a boolean on removal; replaces encoding normalization logic; routes encode/decode/text error paths through an inspector that annotates exceptions with codec-specific notes; exposes a Python binding for unregistering errors.

Changes

Cohort / File(s)	Summary
Codec registry & error handling `crates/vm/src/codecs.rs`	Adds `pub fn unregister_error(&self, name: &str, vm: &VirtualMachine) -> PyResult<bool>`; guards against unregistering built-in handlers; operates on inner error map to remove custom handlers.
Encoding normalization & aliasing `crates/vm/src/codecs.rs`	Replaces normalization algorithm: collapse non-alnum/non-dot to `_`, strip non-ASCII, lowercase ASCII; short-circuits if unchanged; treats `cp65001` as canonical UTF-8 alias in CP mapping.
Encode/decode error annotation `crates/vm/src/codecs.rs`	Wraps encode/decode/text call paths with `inspect_err` and introduces `add_codec_note` to attach codec-specific notes to exceptions; updates several error message wordings.
Python stdlib binding `crates/vm/src/stdlib/codecs.rs`	Adds `_unregister_error(errors: PyStrRef, vm: &VirtualMachine) -> PyResult<bool>` exposed to Python, validating input and delegating to the VM codec registry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Better codecs and fix lots of test_io and other expectedFailures #6997: Modifies the same codecs modules and touches codec lookup/normalization and error-handling paths.

Suggested reviewers

ShaharNaveh

Poem

I nibble bytes and hop through code,
Unregistering traps in a tidy mode,
I stitch names to underscores fine,
Whispering notes when errors whine,
A floppy-eared maintainer, light and bold. 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: implementing additional codec functionality and adding the _unregister_error method to the _codecs module.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-06T04:10:08Z

Code has been automatically formatted

The code in this PR has been formatted using:

cargo fmt --all
Please pull the latest changes before pushing again:

git pull origin codecs

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@crates/vm/src/codecs.rs`:
- Around line 471-472: The cp65001 branch is using a case-sensitive equality
check while the utf branch lowercases the encoding; modify the comparison for
"cp65001" in the function handling encoding (where encoding is matched and
returns Some(Self::Utf8)) to perform a case-insensitive check—e.g., normalize
`encoding` with `to_lowercase()` or compare using a case-insensitive method
before testing for "cp65001" so variants like "CP65001" also map to Self::Utf8.

In `@crates/vm/src/stdlib/codecs.rs`:
- Around line 101-106: _add validation to _unregister_error matching
lookup_error: before calling vm.state.codec_registry.unregister_error(), check
the provided errors name for embedded NUL and surrogate problems like
lookup_error does — if the string contains '\0' return a ValueError("embedded
null character"), and perform the same UTF-8/surrogate check (the same
PyUnicode_AsUTF8AndSize-style validation used in lookup_error) so
surrogate-containing names raise a UnicodeEncodeError; essentially copy the
validation branch from lookup_error (the null-byte check and the call that
forces the surrogate/encoding error) into _unregister_error, then delegate to
unregister_error only after those checks pass.

🧹 Nitpick comments (1)

crates/vm/src/codecs.rs (1)
399-401: Add explicit parentheses around the && sub-expression for clarity.

The expression relies on && binding tighter than ||, which is correct, but easily misread — and clippy::precedence may flag the ! combined with binary operators. Explicit parentheses eliminate any ambiguity.
♻️ Suggested fix
     let needs_transform = encoding
         .bytes()
-        .any(|b| b.is_ascii_uppercase() || !b.is_ascii_alphanumeric() && b != b'.');
+        .any(|b| b.is_ascii_uppercase() || (!b.is_ascii_alphanumeric() && b != b'.'));
As per coding guidelines, "Always run clippy to lint code (cargo clippy) before completing tasks and fix any warnings or lints introduced by your changes".

coderabbitai · 2026-02-06T05:33:02Z

crates/vm/src/codecs.rs

+        } else if encoding == "cp65001" {
            Some(Self::Utf8)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find all call sites of StandardEncoding::parse rg -n --type=rust 'StandardEncoding::parse' -C3

Repository: RustPython/RustPython

Length of output: 1150

🏁 Script executed:

# Also check the context around lines 471-472 in the codecs.rs file rg -n --type=rust 'cp65001|utf' crates/vm/src/codecs.rs -C5

Repository: RustPython/RustPython

Length of output: 1116

🏁 Script executed:

# Find the StandardEncoding::parse function definition to understand the full implementation ast-grep --pattern 'fn parse($_, $_) { $$$ }'

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

#!/bin/bash # Find the context struct definition to understand ctx.encoding rg -n --type=rust 'struct.*Ctx|struct.*Context' crates/vm/src/codecs.rs -A10 | head -50

Repository: RustPython/RustPython

Length of output: 770

🏁 Script executed:

# Find the functions that handle encode/decode errors mentioned in the review rg -n --type=rust 'call_native_encode_error|call_native_decode_error' crates/vm/src/codecs.rs -B5 -A10

Repository: RustPython/RustPython

Length of output: 3859

🏁 Script executed:

# Search for where encoding attribute comes from in exception handling rg -n --type=rust 'encoding.*attribute|exception.*encoding' crates/vm/src/codecs.rs -C5

Repository: RustPython/RustPython

Length of output: 47

🏁 Script executed:

#!/bin/bash # Get the full function for call_native_encode_error sed -n '984,1022p' crates/vm/src/codecs.rs

Repository: RustPython/RustPython

Length of output: 1346

🏁 Script executed:

# Get the full function for call_native_decode_error sed -n '1024,1080p' crates/vm/src/codecs.rs

Repository: RustPython/RustPython

Length of output: 1956

🏁 Script executed:

# Look for where StandardEncoding::parse is called with the extracted encoding rg -n --type=rust 'StandardEncoding::parse' crates/vm/src/codecs.rs -B10 -A2

Repository: RustPython/RustPython

Length of output: 1103

Use case-insensitive comparison for cp65001 to match the UTF branch behavior.

The utf prefix branch normalizes the encoding via encoding.to_lowercase() before checking, but the cp65001 branch uses a case-sensitive equality check. This inconsistency means "CP65001" or "Cp65001" would not match when the encoding comes from an exception's encoding attribute, which is extracted directly without prior normalization.

🐛 Suggested fix

- } else if encoding == "cp65001" { + } else if encoding.eq_ignore_ascii_case("cp65001") {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

} else if encoding == "cp65001" {

Some(Self::Utf8)

} else if encoding.eq_ignore_ascii_case("cp65001") {

Some(Self::Utf8)

🤖 Prompt for AI Agents

In `@crates/vm/src/codecs.rs` around lines 471 - 472, The cp65001 branch is using a case-sensitive equality check while the utf branch lowercases the encoding; modify the comparison for "cp65001" in the function handling encoding (where encoding is matched and returns Some(Self::Utf8)) to perform a case-insensitive check—e.g., normalize `encoding` with `to_lowercase()` or compare using a case-insensitive method before testing for "cp65001" so variants like "CP65001" also map to Self::Utf8.

crates/vm/src/stdlib/codecs.rs

- Rewrite _pycodecs.py: escape_decode, charmap_decode, unicode_escape_decode/encode, raw_unicode_escape_decode/encode - Add _codecs._unregister_error with built-in handler protection and null byte/surrogate validation - Normalize encoding name in register_manual - Add codec error notes via add_note - Fix cp65001 encoding constant in StandardEncoding::parse - Remove 35 expectedFailure markers from test_codecs, test_pickle, test_datetime, pickletools

github-actions · 2026-02-06T08:45:09Z

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/codecs.py
[ ] test: cpython/Lib/test/test_codecs.py (TODO: 11)
[ ] test: cpython/Lib/test/test_codeccallbacks.py (TODO: 9)
[ ] test: cpython/Lib/test/test_codecencodings_cn.py
[ ] test: cpython/Lib/test/test_codecencodings_hk.py
[ ] test: cpython/Lib/test/test_codecencodings_iso2022.py
[ ] test: cpython/Lib/test/test_codecencodings_jp.py
[ ] test: cpython/Lib/test/test_codecencodings_kr.py
[ ] test: cpython/Lib/test/test_codecencodings_tw.py
[ ] test: cpython/Lib/test/test_codecmaps_cn.py
[ ] test: cpython/Lib/test/test_codecmaps_hk.py
[ ] test: cpython/Lib/test/test_codecmaps_jp.py
[ ] test: cpython/Lib/test/test_codecmaps_kr.py
[ ] test: cpython/Lib/test/test_codecmaps_tw.py
[ ] test: cpython/Lib/test/test_charmapcodec.py
[ ] test: cpython/Lib/test/test_multibytecodec.py

dependencies:

codecs

dependent tests: (119 tests)

[x] lib: cpython/Lib/pickletools.py
[ ] test: cpython/Lib/test/test_pickletools.py (TODO: 8)

dependencies:

pickletools

dependent tests: (1 tests)

pickletools: test_pickletools

[x] lib: cpython/Lib/pickle.py
[ ] lib: cpython/Lib/_compat_pickle.py
[ ] test: cpython/Lib/test/test_pickle.py (TODO: 22)
[ ] test: cpython/Lib/test/test_picklebuffer.py (TODO: 12)
[ ] test: cpython/Lib/test/test_pickletools.py (TODO: 8)

dependencies:

dependent tests: (70 tests)

pickle: test_annotationlib test_array test_asyncio test_builtin test_bytes test_bz2 test_codecs test_collections test_concurrent_futures test_coroutines test_csv test_ctypes test_decimal test_defaultdict test_deque test_descr test_dict test_dictviews test_email test_enum test_enumerate test_exceptions test_fractions test_functools test_generators test_genericalias test_http_cookies test_importlib test_inspect test_io test_ipaddress test_iter test_itertools test_list test_logging test_lzma test_memoryio test_memoryview test_opcache test_operator test_ordered_dict test_os test_pathlib test_pickle test_pickletools test_platform test_plistlib test_positional_only_arg test_posix test_random test_range test_set test_shelve test_slice test_socket test_statistics test_str test_string test_trace test_tuple test_types test_typing test_unittest test_uuid test_xml_dom_minicompat test_xml_etree test_zipfile test_zlib test_zoneinfo
- logging.handlers: test_concurrent_futures

Legend:

[+] path exists in CPython
[x] up-to-date, [ ] outdated

youknowone force-pushed the codecs branch from 6f33d97 to d95fcac Compare February 6, 2026 04:09

youknowone marked this pull request as ready for review February 6, 2026 05:22

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

youknowone force-pushed the codecs branch from 16f6d62 to 053980d Compare February 6, 2026 08:43

youknowone changed the title ~~Impl more codecs~~ Impl more codecs and _codecs._unregister_error Feb 6, 2026

youknowone merged commit 8127000 into RustPython:main Feb 6, 2026
13 of 14 checks passed

youknowone deleted the codecs branch February 6, 2026 11:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impl more codecs and _codecs._unregister_error#7025

Impl more codecs and _codecs._unregister_error#7025
youknowone merged 1 commit intoRustPython:mainfrom
youknowone:codecs

youknowone commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 6, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

youknowone commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

📦 Library Dependencies

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

youknowone commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 6, 2026 •

edited

Loading