Skip to content

Impl more codecs and _codecs._unregister_error#7025

Merged
youknowone merged 1 commit intoRustPython:mainfrom
youknowone:codecs
Feb 6, 2026
Merged

Impl more codecs and _codecs._unregister_error#7025
youknowone merged 1 commit intoRustPython:mainfrom
youknowone:codecs

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Feb 6, 2026

Summary by CodeRabbit

  • New Features

    • Added ability to unregister custom codec error handlers (operation returns success/failure and protects built-in handlers)
  • Bug Fixes

    • Improved error messages with codec-specific context during encode/decode operations
    • Updated encoding name normalization and UTF-8 alias handling
    • Ensured codec error notes are preserved across lookups and caching

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 6, 2026

📝 Walkthrough

Walkthrough

Adds an API to unregister codec error handlers, prevents removing built-ins, and returns a boolean on removal; replaces encoding normalization logic; routes encode/decode/text error paths through an inspector that annotates exceptions with codec-specific notes; exposes a Python binding for unregistering errors.

Changes

Cohort / File(s) Summary
Codec registry & error handling
crates/vm/src/codecs.rs
Adds pub fn unregister_error(&self, name: &str, vm: &VirtualMachine) -> PyResult<bool>; guards against unregistering built-in handlers; operates on inner error map to remove custom handlers.
Encoding normalization & aliasing
crates/vm/src/codecs.rs
Replaces normalization algorithm: collapse non-alnum/non-dot to _, strip non-ASCII, lowercase ASCII; short-circuits if unchanged; treats cp65001 as canonical UTF-8 alias in CP mapping.
Encode/decode error annotation
crates/vm/src/codecs.rs
Wraps encode/decode/text call paths with inspect_err and introduces add_codec_note to attach codec-specific notes to exceptions; updates several error message wordings.
Python stdlib binding
crates/vm/src/stdlib/codecs.rs
Adds _unregister_error(errors: PyStrRef, vm: &VirtualMachine) -> PyResult<bool> exposed to Python, validating input and delegating to the VM codec registry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh

Poem

I nibble bytes and hop through code,
Unregistering traps in a tidy mode,
I stitch names to underscores fine,
Whispering notes when errors whine,
A floppy-eared maintainer, light and bold. 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: implementing additional codec functionality and adding the _unregister_error method to the _codecs module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin codecs

@youknowone youknowone marked this pull request as ready for review February 6, 2026 05:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@crates/vm/src/codecs.rs`:
- Around line 471-472: The cp65001 branch is using a case-sensitive equality
check while the utf branch lowercases the encoding; modify the comparison for
"cp65001" in the function handling encoding (where encoding is matched and
returns Some(Self::Utf8)) to perform a case-insensitive check—e.g., normalize
`encoding` with `to_lowercase()` or compare using a case-insensitive method
before testing for "cp65001" so variants like "CP65001" also map to Self::Utf8.

In `@crates/vm/src/stdlib/codecs.rs`:
- Around line 101-106: _add validation to _unregister_error matching
lookup_error: before calling vm.state.codec_registry.unregister_error(), check
the provided errors name for embedded NUL and surrogate problems like
lookup_error does — if the string contains '\0' return a ValueError("embedded
null character"), and perform the same UTF-8/surrogate check (the same
PyUnicode_AsUTF8AndSize-style validation used in lookup_error) so
surrogate-containing names raise a UnicodeEncodeError; essentially copy the
validation branch from lookup_error (the null-byte check and the call that
forces the surrogate/encoding error) into _unregister_error, then delegate to
unregister_error only after those checks pass.
🧹 Nitpick comments (1)
crates/vm/src/codecs.rs (1)

399-401: Add explicit parentheses around the && sub-expression for clarity.

The expression relies on && binding tighter than ||, which is correct, but easily misread — and clippy::precedence may flag the ! combined with binary operators. Explicit parentheses eliminate any ambiguity.

♻️ Suggested fix
     let needs_transform = encoding
         .bytes()
-        .any(|b| b.is_ascii_uppercase() || !b.is_ascii_alphanumeric() && b != b'.');
+        .any(|b| b.is_ascii_uppercase() || (!b.is_ascii_alphanumeric() && b != b'.'));

As per coding guidelines, "Always run clippy to lint code (cargo clippy) before completing tasks and fix any warnings or lints introduced by your changes".

Comment on lines +471 to 472
} else if encoding == "cp65001" {
Some(Self::Utf8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all call sites of StandardEncoding::parse
rg -n --type=rust 'StandardEncoding::parse' -C3

Repository: RustPython/RustPython

Length of output: 1150


🏁 Script executed:

# Also check the context around lines 471-472 in the codecs.rs file
rg -n --type=rust 'cp65001|utf' crates/vm/src/codecs.rs -C5

Repository: RustPython/RustPython

Length of output: 1116


🏁 Script executed:

# Find the StandardEncoding::parse function definition to understand the full implementation
ast-grep --pattern 'fn parse($_, $_) {
  $$$
}'

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Find the context struct definition to understand ctx.encoding
rg -n --type=rust 'struct.*Ctx|struct.*Context' crates/vm/src/codecs.rs -A10 | head -50

Repository: RustPython/RustPython

Length of output: 770


🏁 Script executed:

# Find the functions that handle encode/decode errors mentioned in the review
rg -n --type=rust 'call_native_encode_error|call_native_decode_error' crates/vm/src/codecs.rs -B5 -A10

Repository: RustPython/RustPython

Length of output: 3859


🏁 Script executed:

# Search for where encoding attribute comes from in exception handling
rg -n --type=rust 'encoding.*attribute|exception.*encoding' crates/vm/src/codecs.rs -C5

Repository: RustPython/RustPython

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Get the full function for call_native_encode_error
sed -n '984,1022p' crates/vm/src/codecs.rs

Repository: RustPython/RustPython

Length of output: 1346


🏁 Script executed:

# Get the full function for call_native_decode_error
sed -n '1024,1080p' crates/vm/src/codecs.rs

Repository: RustPython/RustPython

Length of output: 1956


🏁 Script executed:

# Look for where StandardEncoding::parse is called with the extracted encoding
rg -n --type=rust 'StandardEncoding::parse' crates/vm/src/codecs.rs -B10 -A2

Repository: RustPython/RustPython

Length of output: 1103


Use case-insensitive comparison for cp65001 to match the UTF branch behavior.

The utf prefix branch normalizes the encoding via encoding.to_lowercase() before checking, but the cp65001 branch uses a case-sensitive equality check. This inconsistency means "CP65001" or "Cp65001" would not match when the encoding comes from an exception's encoding attribute, which is extracted directly without prior normalization.

🐛 Suggested fix
-        } else if encoding == "cp65001" {
+        } else if encoding.eq_ignore_ascii_case("cp65001") {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} else if encoding == "cp65001" {
Some(Self::Utf8)
} else if encoding.eq_ignore_ascii_case("cp65001") {
Some(Self::Utf8)
🤖 Prompt for AI Agents
In `@crates/vm/src/codecs.rs` around lines 471 - 472, The cp65001 branch is using
a case-sensitive equality check while the utf branch lowercases the encoding;
modify the comparison for "cp65001" in the function handling encoding (where
encoding is matched and returns Some(Self::Utf8)) to perform a case-insensitive
check—e.g., normalize `encoding` with `to_lowercase()` or compare using a
case-insensitive method before testing for "cp65001" so variants like "CP65001"
also map to Self::Utf8.

- Rewrite _pycodecs.py: escape_decode, charmap_decode,
  unicode_escape_decode/encode, raw_unicode_escape_decode/encode
- Add _codecs._unregister_error with built-in handler protection
  and null byte/surrogate validation
- Normalize encoding name in register_manual
- Add codec error notes via add_note
- Fix cp65001 encoding constant in StandardEncoding::parse
- Remove 35 expectedFailure markers from test_codecs,
  test_pickle, test_datetime, pickletools
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] lib: cpython/Lib/codecs.py
[ ] test: cpython/Lib/test/test_codecs.py (TODO: 11)
[ ] test: cpython/Lib/test/test_codeccallbacks.py (TODO: 9)
[ ] test: cpython/Lib/test/test_codecencodings_cn.py
[ ] test: cpython/Lib/test/test_codecencodings_hk.py
[ ] test: cpython/Lib/test/test_codecencodings_iso2022.py
[ ] test: cpython/Lib/test/test_codecencodings_jp.py
[ ] test: cpython/Lib/test/test_codecencodings_kr.py
[ ] test: cpython/Lib/test/test_codecencodings_tw.py
[ ] test: cpython/Lib/test/test_codecmaps_cn.py
[ ] test: cpython/Lib/test/test_codecmaps_hk.py
[ ] test: cpython/Lib/test/test_codecmaps_jp.py
[ ] test: cpython/Lib/test/test_codecmaps_kr.py
[ ] test: cpython/Lib/test/test_codecmaps_tw.py
[ ] test: cpython/Lib/test/test_charmapcodec.py
[ ] test: cpython/Lib/test/test_multibytecodec.py

dependencies:

  • codecs

dependent tests: (119 tests)

  • codecs: test_charmapcodec test_codeccallbacks test_codecs test_eof test_exceptions test_importlib test_io test_json test_locale test_logging test_os test_plistlib test_str test_sys
    • encodings:
      • locale: test__locale test_builtin test_c_locale_coercion test_calendar test_decimal test_format test_re test_regrtest test_time test_types test_utf8_mode
    • json: test_subprocess test_sysconfig test_tomllib test_tools test_traceback test_zoneinfo
      • importlib.metadata: test_importlib
    • pickle: test_annotationlib test_array test_asyncio test_bytes test_bz2 test_collections test_concurrent_futures test_coroutines test_csv test_ctypes test_defaultdict test_deque test_descr test_dict test_dictviews test_email test_enum test_enumerate test_fractions test_functools test_generators test_genericalias test_http_cookies test_inspect test_ipaddress test_iter test_itertools test_list test_lzma test_memoryio test_memoryview test_opcache test_operator test_ordered_dict test_pathlib test_pickle test_pickletools test_platform test_positional_only_arg test_posix test_random test_range test_set test_shelve test_slice test_socket test_statistics test_string test_trace test_tuple test_typing test_unittest test_uuid test_xml_dom_minicompat test_xml_etree test_zipfile test_zlib test_zoneinfo
      • logging.handlers: test_concurrent_futures
    • tokenize: test_linecache test_tabnanny test_tokenize test_unparse
      • inspect: test_abc test_argparse test_asyncgen test_code test_grammar test_ntpath test_patma test_posixpath test_signal test_yield_from test_zipimport
      • traceback: test_asyncio test_code_module test_contextlib test_contextlib_async test_importlib test_listcomps test_pyexpat test_setcomps test_ssl test_threadedtempfile test_threading test_unittest test_with

[x] lib: cpython/Lib/pickletools.py
[ ] test: cpython/Lib/test/test_pickletools.py (TODO: 8)

dependencies:

  • pickletools

dependent tests: (1 tests)

  • pickletools: test_pickletools

[x] lib: cpython/Lib/pickle.py
[ ] lib: cpython/Lib/_compat_pickle.py
[ ] test: cpython/Lib/test/test_pickle.py (TODO: 22)
[ ] test: cpython/Lib/test/test_picklebuffer.py (TODO: 12)
[ ] test: cpython/Lib/test/test_pickletools.py (TODO: 8)

dependencies:

  • pickle (native: itertools, sys)
    • _compat_pickle
    • _compat_pickle
    • io (native: _io, _thread, errno, sys)
    • types
    • codecs, copyreg, functools, struct

dependent tests: (70 tests)

  • pickle: test_annotationlib test_array test_asyncio test_builtin test_bytes test_bz2 test_codecs test_collections test_concurrent_futures test_coroutines test_csv test_ctypes test_decimal test_defaultdict test_deque test_descr test_dict test_dictviews test_email test_enum test_enumerate test_exceptions test_fractions test_functools test_generators test_genericalias test_http_cookies test_importlib test_inspect test_io test_ipaddress test_iter test_itertools test_list test_logging test_lzma test_memoryio test_memoryview test_opcache test_operator test_ordered_dict test_os test_pathlib test_pickle test_pickletools test_platform test_plistlib test_positional_only_arg test_posix test_random test_range test_set test_shelve test_slice test_socket test_statistics test_str test_string test_trace test_tuple test_types test_typing test_unittest test_uuid test_xml_dom_minicompat test_xml_etree test_zipfile test_zlib test_zoneinfo
    • logging.handlers: test_concurrent_futures

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

@youknowone youknowone changed the title Impl more codecs Impl more codecs and _codecs._unregister_error Feb 6, 2026
@youknowone youknowone merged commit 8127000 into RustPython:main Feb 6, 2026
13 of 14 checks passed
@youknowone youknowone deleted the codecs branch February 6, 2026 11:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant