Impl more codecs and _codecs._unregister_error#7025
Impl more codecs and _codecs._unregister_error#7025youknowone merged 1 commit intoRustPython:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds an API to unregister codec error handlers, prevents removing built-ins, and returns a boolean on removal; replaces encoding normalization logic; routes encode/decode/text error paths through an inspector that annotates exceptions with codec-specific notes; exposes a Python binding for unregistering errors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin codecs |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@crates/vm/src/codecs.rs`:
- Around line 471-472: The cp65001 branch is using a case-sensitive equality
check while the utf branch lowercases the encoding; modify the comparison for
"cp65001" in the function handling encoding (where encoding is matched and
returns Some(Self::Utf8)) to perform a case-insensitive check—e.g., normalize
`encoding` with `to_lowercase()` or compare using a case-insensitive method
before testing for "cp65001" so variants like "CP65001" also map to Self::Utf8.
In `@crates/vm/src/stdlib/codecs.rs`:
- Around line 101-106: _add validation to _unregister_error matching
lookup_error: before calling vm.state.codec_registry.unregister_error(), check
the provided errors name for embedded NUL and surrogate problems like
lookup_error does — if the string contains '\0' return a ValueError("embedded
null character"), and perform the same UTF-8/surrogate check (the same
PyUnicode_AsUTF8AndSize-style validation used in lookup_error) so
surrogate-containing names raise a UnicodeEncodeError; essentially copy the
validation branch from lookup_error (the null-byte check and the call that
forces the surrogate/encoding error) into _unregister_error, then delegate to
unregister_error only after those checks pass.
🧹 Nitpick comments (1)
crates/vm/src/codecs.rs (1)
399-401: Add explicit parentheses around the&&sub-expression for clarity.The expression relies on
&&binding tighter than||, which is correct, but easily misread — andclippy::precedencemay flag the!combined with binary operators. Explicit parentheses eliminate any ambiguity.♻️ Suggested fix
let needs_transform = encoding .bytes() - .any(|b| b.is_ascii_uppercase() || !b.is_ascii_alphanumeric() && b != b'.'); + .any(|b| b.is_ascii_uppercase() || (!b.is_ascii_alphanumeric() && b != b'.'));As per coding guidelines, "Always run clippy to lint code (
cargo clippy) before completing tasks and fix any warnings or lints introduced by your changes".
| } else if encoding == "cp65001" { | ||
| Some(Self::Utf8) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all call sites of StandardEncoding::parse
rg -n --type=rust 'StandardEncoding::parse' -C3Repository: RustPython/RustPython
Length of output: 1150
🏁 Script executed:
# Also check the context around lines 471-472 in the codecs.rs file
rg -n --type=rust 'cp65001|utf' crates/vm/src/codecs.rs -C5Repository: RustPython/RustPython
Length of output: 1116
🏁 Script executed:
# Find the StandardEncoding::parse function definition to understand the full implementation
ast-grep --pattern 'fn parse($_, $_) {
$$$
}'Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Find the context struct definition to understand ctx.encoding
rg -n --type=rust 'struct.*Ctx|struct.*Context' crates/vm/src/codecs.rs -A10 | head -50Repository: RustPython/RustPython
Length of output: 770
🏁 Script executed:
# Find the functions that handle encode/decode errors mentioned in the review
rg -n --type=rust 'call_native_encode_error|call_native_decode_error' crates/vm/src/codecs.rs -B5 -A10Repository: RustPython/RustPython
Length of output: 3859
🏁 Script executed:
# Search for where encoding attribute comes from in exception handling
rg -n --type=rust 'encoding.*attribute|exception.*encoding' crates/vm/src/codecs.rs -C5Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Get the full function for call_native_encode_error
sed -n '984,1022p' crates/vm/src/codecs.rsRepository: RustPython/RustPython
Length of output: 1346
🏁 Script executed:
# Get the full function for call_native_decode_error
sed -n '1024,1080p' crates/vm/src/codecs.rsRepository: RustPython/RustPython
Length of output: 1956
🏁 Script executed:
# Look for where StandardEncoding::parse is called with the extracted encoding
rg -n --type=rust 'StandardEncoding::parse' crates/vm/src/codecs.rs -B10 -A2Repository: RustPython/RustPython
Length of output: 1103
Use case-insensitive comparison for cp65001 to match the UTF branch behavior.
The utf prefix branch normalizes the encoding via encoding.to_lowercase() before checking, but the cp65001 branch uses a case-sensitive equality check. This inconsistency means "CP65001" or "Cp65001" would not match when the encoding comes from an exception's encoding attribute, which is extracted directly without prior normalization.
🐛 Suggested fix
- } else if encoding == "cp65001" {
+ } else if encoding.eq_ignore_ascii_case("cp65001") {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if encoding == "cp65001" { | |
| Some(Self::Utf8) | |
| } else if encoding.eq_ignore_ascii_case("cp65001") { | |
| Some(Self::Utf8) |
🤖 Prompt for AI Agents
In `@crates/vm/src/codecs.rs` around lines 471 - 472, The cp65001 branch is using
a case-sensitive equality check while the utf branch lowercases the encoding;
modify the comparison for "cp65001" in the function handling encoding (where
encoding is matched and returns Some(Self::Utf8)) to perform a case-insensitive
check—e.g., normalize `encoding` with `to_lowercase()` or compare using a
case-insensitive method before testing for "cp65001" so variants like "CP65001"
also map to Self::Utf8.
- Rewrite _pycodecs.py: escape_decode, charmap_decode, unicode_escape_decode/encode, raw_unicode_escape_decode/encode - Add _codecs._unregister_error with built-in handler protection and null byte/surrogate validation - Normalize encoding name in register_manual - Add codec error notes via add_note - Fix cp65001 encoding constant in StandardEncoding::parse - Remove 35 expectedFailure markers from test_codecs, test_pickle, test_datetime, pickletools
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/codecs.py dependencies:
dependent tests: (119 tests)
[x] lib: cpython/Lib/pickletools.py dependencies:
dependent tests: (1 tests)
[x] lib: cpython/Lib/pickle.py dependencies:
dependent tests: (70 tests)
Legend:
|
Summary by CodeRabbit
New Features
Bug Fixes