Better codecs and fix lots of test_io and other expectedFailures#6997
Better codecs and fix lots of test_io and other expectedFailures#6997youknowone merged 4 commits intoRustPython:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds partial-output handling to SurrogateEscape encoding; tightens codec-name validation and exposes Changes
Sequence Diagram(s)sequenceDiagram
participant GC as GC (finalizer)
participant Wrapper as TextIOWrapper
participant Raw as RawBuffer/FileIO
participant VM as VirtualMachine
GC->>Wrapper: invoke destructor / finalize
Wrapper->>Wrapper: check/set finalizing (AtomicBool)
alt first finalize
Wrapper->>Raw: call close()/flush()
Raw-->>Wrapper: success / failure
alt failure or not closed
Wrapper->>VM: emit ResourceWarning via iobase_finalize/_dealloc_warn
end
else reentrant finalize
Wrapper-->>GC: skip redundant finalize
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
c160d81 to
a10351b
Compare
|
Code has been automatically formatted The code in this PR has been formatted using:
git pull origin test_io |
crates/common/src/encodings.rs
Outdated
| } | ||
| } | ||
|
|
||
| pub mod utf7 { |
0194bb5 to
856c6f1
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/vm/src/stdlib/io.rs (1)
2687-2727:⚠️ Potential issue | 🟠 MajorRoute
TextIOWrapperencoding parameter throughtext_encoding()to emit EncodingWarning.Per PEP 597,
TextIOWrapper(..., encoding=None)should emitEncodingWarningwhenwarn_default_encodingis enabled, but currently it bypasses the warning mechanism by callingresolve_encoding()directly. This is inconsistent withio.open(), which correctly usestext_encoding()to emit the warning. TheNonecase inTextIOWrapper::init()should be routed throughtext_encoding()to comply with PEP 597.
🤖 Fix all issues with AI agents
In `@crates/vm/src/stdlib/io.rs`:
- Around line 2770-2862: The bool_from_index function currently calls
value.try_index() and converts to i32 which breaks truthiness and can overflow;
replace its body to directly use value.try_to_bool(vm) and return that result
(i.e., remove try_index()/try_to_primitive() logic and simply call
value.try_to_bool(vm)?), keeping the function signature (fn
bool_from_index(value: PyObjectRef, vm: &VirtualMachine) -> PyResult<bool>) so
callers (and the pattern used elsewhere with try_to_bool) continue to work.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@crates/vm/src/stdlib/codecs.rs`:
- Around line 996-1013: The replacement-handling branch currently rejects
non-ASCII code points; instead, encode the replacement PyStr using the same code
page as the fast path by calling try_encode_code_page_strict on rep_str with
encoding_str and handle errors the same way as other encoding failures: on
success extend output with the returned bytes, on failure return
vm.new_unicode_encode_error_real (preserving s, pos, encoding_str and an
appropriate error message). Replace the manual per-codepoint ASCII check with
this call in the branch that inspects replacement.downcast_ref::<PyStr>() so the
error handler can emit bytes valid for encodings like cp1252.
| if let Some(bytes) = replacement.downcast_ref::<PyBytes>() { | ||
| output.extend_from_slice(bytes); | ||
| } else if let Some(rep_str) = replacement.downcast_ref::<PyStr>() { | ||
| // Replacement string - try to encode each character | ||
| for rcp in rep_str.as_wtf8().code_points() { | ||
| let rch = rcp.to_u32(); | ||
| if rch > 127 { | ||
| return Err(vm.new_unicode_encode_error_real( | ||
| encoding_str.clone(), | ||
| s.clone(), | ||
| pos, | ||
| pos + 1, | ||
| vm.ctx | ||
| .new_str("unable to encode error handler result to ASCII"), | ||
| )); | ||
| } | ||
| output.push(rch as u8); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file size and read the relevant section
wc -l crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 100
🏁 Script executed:
# Read the code around lines 996-1013 with broader context
sed -n '950,1050p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 3832
🏁 Script executed:
# Search for try_encode_code_page_strict function
rg "try_encode_code_page_strict" crates/vm/src/stdlib/codecs.rs -A 10Repository: RustPython/RustPython
Length of output: 834
🏁 Script executed:
# Also search for the function containing lines 996-1013
rg "replacement.downcast_ref" crates/vm/src/stdlib/codecs.rs -B 30 -A 5Repository: RustPython/RustPython
Length of output: 1599
🏁 Script executed:
# Find the encode_code_page_errors function signature and its parameters
rg "fn encode_code_page_errors" crates/vm/src/stdlib/codecs.rs -A 50Repository: RustPython/RustPython
Length of output: 2076
🏁 Script executed:
# Look for reason_str variable in the function
rg "reason_str" crates/vm/src/stdlib/codecs.rs -B 5 -A 5Repository: RustPython/RustPython
Length of output: 3130
🏁 Script executed:
# Check if cp1252 or code page tests exist to understand what's supported
fd "cp1252\|code.page" crates/vm/src/stdlib/ -type fRepository: RustPython/RustPython
Length of output: 235
🏁 Script executed:
# Search for any tests related to code page encoding with replacements
rg "encode.*replace\|error.*handler.*replace" crates/vm/src/stdlib/codecs.rs -iRepository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Get more context around the replacement handling to see the full function
sed -n '920,1040p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 4373
🏁 Script executed:
# Verify PyStr has to_str() method
rg "impl.*PyStr\|fn to_str\(" crates/vm/src/stdlib/codecs.rs | head -20Repository: RustPython/RustPython
Length of output: 47
🏁 Script executed:
# Look for PyStr definition or methods
rg "pub.*to_str|impl.*PyStr" crates/ -t rs | grep -i pystr | head -10Repository: RustPython/RustPython
Length of output: 91
🏁 Script executed:
# Check if to_str is used elsewhere in this file
rg "\.to_str\(\)" crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 194
🏁 Script executed:
# Check the type of args.s to confirm it has to_str()
rg "struct CodePageEncodeArgs" crates/vm/src/stdlib/codecs.rs -A 5Repository: RustPython/RustPython
Length of output: 238
🏁 Script executed:
# Verify the code_points array/variable used in pos indexing
sed -n '900,920p' crates/vm/src/stdlib/codecs.rsRepository: RustPython/RustPython
Length of output: 723
🏁 Script executed:
# Check for any imports of ToWideString in this file
rg "use.*ToWideString\|use.*windows" crates/vm/src/stdlib/codecs.rs | head -10Repository: RustPython/RustPython
Length of output: 47
Encode replacement strings using the same code page instead of ASCII-only checks.
The current code rejects replacement strings with code points > 0x7F, which prevents valid error handler replacements for code pages like cp1252 that support characters 0x80-0xFF. Use try_encode_code_page_strict (already used in the fast path) to encode replacement strings with the target code page.
🛠️ Suggested fix (encode replacement via code page)
- } else if let Some(rep_str) = replacement.downcast_ref::<PyStr>() {
- // Replacement string - try to encode each character
- for rcp in rep_str.as_wtf8().code_points() {
- let rch = rcp.to_u32();
- if rch > 127 {
- return Err(vm.new_unicode_encode_error_real(
- encoding_str.clone(),
- s.clone(),
- pos,
- pos + 1,
- vm.ctx
- .new_str("unable to encode error handler result to ASCII"),
- ));
- }
- output.push(rch as u8);
- }
- } else {
+ } else if let Some(rep_str) = replacement.downcast_ref::<PyStr>() {
+ let rep = rep_str.to_str().ok_or_else(|| {
+ vm.new_unicode_encode_error_real(
+ encoding_str.clone(),
+ s.clone(),
+ pos,
+ pos + 1,
+ reason_str.clone(),
+ )
+ })?;
+ use crate::common::windows::ToWideString;
+ let wide: Vec<u16> = std::ffi::OsStr::new(rep).to_wide();
+ match try_encode_code_page_strict(code_page, &wide, vm)? {
+ Some(bytes) => output.extend_from_slice(&bytes),
+ None => {
+ return Err(vm.new_unicode_encode_error_real(
+ encoding_str.clone(),
+ s.clone(),
+ pos,
+ pos + 1,
+ reason_str.clone(),
+ ))
+ }
+ }
+ } else {🤖 Prompt for AI Agents
In `@crates/vm/src/stdlib/codecs.rs` around lines 996 - 1013, The
replacement-handling branch currently rejects non-ASCII code points; instead,
encode the replacement PyStr using the same code page as the fast path by
calling try_encode_code_page_strict on rep_str with encoding_str and handle
errors the same way as other encoding failures: on success extend output with
the returned bytes, on failure return vm.new_unicode_encode_error_real
(preserving s, pos, encoding_str and an appropriate error message). Replace the
manual per-codepoint ASCII check with this call in the branch that inspects
replacement.downcast_ref::<PyStr>() so the error handler can emit bytes valid
for encodings like cp1252.
📦 Library DependenciesThe following Lib/ modules were modified. Here are their dependencies: [x] lib: cpython/Lib/codecs.py dependencies:
dependent tests: (106 tests)
[ ] lib: cpython/Lib/encodings dependencies:
dependent tests: (57 tests)
[x] lib: cpython/Lib/bz2.py dependencies:
dependent tests: (1 tests)
[x] lib: cpython/Lib/fileinput.py dependencies:
dependent tests: (2 tests)
[x] lib: cpython/Lib/gzip.py dependencies:
dependent tests: (no tests depend on gzip) [ ] lib: cpython/Lib/io.py dependencies:
dependent tests: (87 tests)
[ ] lib: cpython/Lib/logging dependencies:
dependent tests: (9 tests)
[x] lib: cpython/Lib/lzma.py dependencies:
dependent tests: (1 tests)
[ ] lib: cpython/Lib/plistlib.py dependencies:
dependent tests: (1 tests)
[ ] lib: cpython/Lib/test/libregrtest dependencies:
dependent tests: (no tests depend on regrtest) [ ] test: cpython/Lib/test/test_str.py (TODO: 16) dependencies: dependent tests: (no tests depend on str) [ ] lib: cpython/Lib/tarfile.py dependencies:
dependent tests: (2 tests)
[ ] test: cpython/Lib/test/test_utf8_mode.py (TODO: 6) dependencies: dependent tests: (no tests depend on utf8_mode) Legend:
|
Summary by CodeRabbit
New Features
-X utf8command-line option to control UTF‑8 modeImprovements