Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Assertion failure editing invalid UTF-32 #11877
Comments
I imagine the correct approach here would actually be to use The worst that could happen is that on platforms that actually do throw a signal on cast overflow/underflow (do we support them in the first place?), SIGABRT or similar will be thrown.. which the current code does for all platforms when the assert fails, so it'll be strictly an improvement. |
What does this even mean?
Before getting stuck in the same obsession about casting, assertions, checked conversions etc shouldn't we consider the actual semantics of the code and the end goals that it is trying to accomplish? What the code does is converting a codepoint to UTF-8. The limit of a codepoint is defined to be 0x10FFFF (and not what the limit of some int type happens to be), but because $REASONS vim still handles values up to 0x7FFFFFFF. Beyond that vim seems to silently wrap around values (to random unrelated codepoints). The proper handling of invalid codepoints (if we don't want to error out the entire file read, which I think we don't) is to change them to the replacement char U+FFFD �. Ideally we would replace all values over 0x10FFFF ( #6325 ), but as to fix the conversion issue, we could replace values above INT_MAX to 0xFFFD already. Perhaps there is some flag that also should be set that the conversion was destructive (which it already is). |
Yes, but I'm not talking about the actual conversion here but rather the ingestion of unknown data that may or may not be valid. I'm not talking about the actual Unicode-aware logic but merely the mechanics of converting individual bytes to a Unicode codepoint. Given that you are going from n bytes to a may-or-may-not-be-valid-Unicode codepoint and you want to evaluate the validity (or lack thereof) of the maybe-codepoint, you can't assume that the up-to-four-byte sequence you read will lie in the 0x000000-0x10FFFF range. If you want to make your decision about whether or not a particular sequence of four bytes is valid Unicode after having created it (vs determining whether ingesting a particular, single byte will result in the codepoint being formed being out-of-range, which is just complicating things for no good reason) then you cannot assume that this sequence of bytes will fit in a signed 4-byte integer upon which you predicate your validity test. That said, I agree with everything else you said about lossy conversion and not aborting the process merely because an invalid codepoint was found, but I think we disagree about what the ideal behavior when an invalid codepoint is encountered should be. IMHO, in an ideal world and in a greenfield implementation if I load a binary file as UTF-8, I would like to still be able to save it in neovim without corrupting its original contents: e.g. there should be a difference between the lossy representation of a sequence of bytes in the editor and the actual underlying value. I believe the editor should show U+FFFD but preserve the actual values unless the user selects the � character and replaces it with something else. In fact, I think that I would prefer emitting invalid UTF-8 and having the terminal deal with it over actively destructive manipulation of the input bytestream (even if it causes neovim and the terminal to differ on the length (in on-screen cells) of invalid sequences which could cause many other problems.) |
Again, a codepoint doesn't fit into uint8_t. Do you mean uint32_t should be used? I don't think this helps very much because even vim:s improper-utf-8 is still 31 bits and not 32 bits.
This is about reading UTF-32. Not about reading UTF-8. We are reading exactly four bytes at a time here with no assumptions.
100% agree and #6325 preserves this, but this is a separate question from UTF-32 to UTF-8 conversion which was the code path discussed in this issue. Invalid UTF-32 cannot be fully preserved as only 31 bits can be represented and not 32. (you can of course edit it as binary, but then you get no conversion for valid chars). The code path in vim is already destructive, I just want to make it deterministically destructive (always replacement char, instead of the whims of the gods of Implementation Defined Behavior) and somewhat closer to "sane" unicode semantics even if not fully there there. |
Sorry, I had actually edited my comment but you were too quick for me. I was interrupted non-stop in writing it and ended up with a mess. |
I don't think it has to be destructive at all. Ideally edits can be converted into whatever codepage the file is specified (via fenc) as having (after all, replacing existing valid characters or adding new ones is no different than selecting the � marker and replacing that with valid characters) but unmodified content would be preserved. |
No worries, but let's focus on UTF-32 reading only for now, we can consider overlong UTF-8 in the linked PR and forget about it for now. The concrete thing I'm suggesting for this particular conversion issue is to replace values over INT_MAX with the replacement char. Any objection to that? |
Would an integration test that opens a binary file, then only calls |
it has to be destructive because vim/nvim cannot represent 2^32 codepoints, only 2^31. This is a fundamental limit that ain't gonna change anytime soon. I'm just saying we should use a deterministic � value as suggested by the unicode standard rather than what the C implementation happens to be giving us (as vim relies upon).
One destructive behavior is changed to another destructive behavior. I suppose it is subjective if this counts as a "regression" (spacebar heating). |
I thought you were saying "it is already destructive because vim uses signed types and that results in implementation-specific behavior" and was prepared to argue about theoretical vs actual destructive behavior but no, it's a lot more broken than that and it is/was completely destructive right now, regardless of platform or implementation-specific behavior. Swapping one destructive behavior for (a much more reasonable) another is perfectly fine with me. As far as I'm now concerned, the result of such an operation can be treated as (neo)vim's equivalent of undefined behavior and it's fair game to change it. |
Hi, I'd like to try working on fixing this issue, if possible. I replicated the bug. Could you confirm if my understanding of the problem is correct? What is going wrongUTF-8 is designed to handle up to 0x10FFFF, but vim for some reason handles values up to 0x7FFFFFFF Suggested solutionReplace values above INT_MAX to 0xFFFD |
The suggested solution is the correct one here, but "What is going wrong" is a separate issue, addressed by #6325 potentially. |
nvim --version
: NVIM v0.5.0-23-gafd576ee9vim -u DEFAULTS
(version: ) behaves differently? yes$TERM
: xterm-256colorSteps to reproduce using
nvim -u NORC
Download:
neovim_enc_bug.txt
Actual behaviour
Expected behaviour
No crash. (Vim 8 does not crash.)