Assertion failure editing invalid UTF-32 #11877

mqudsi · 2020-02-15T17:35:10Z

nvim --version: NVIM v0.5.0-23-gafd576ee9
vim -u DEFAULTS (version: ) behaves differently? yes
Operating system/version: Ubuntu 18.0.4.1 via WSL
Terminal name/version: conhost
$TERM: xterm-256color

Steps to reproduce using `nvim -u NORC`

nvim -u NORC neovim_enc_bug.txt
:e ++enc=utf-32

Actual behaviour

nvim: ../src/nvim/fileio.c:1427: int readfile(char_u *, char_u *, linenr_T, linenr_T, linenr_T, exarg_T *, int): Assertion `u8c <= INT_MAX' failed.

Expected behaviour

No crash. (Vim 8 does not crash.)

justinmk · 2020-02-16T00:23:41Z

we added this check in 7e4fd04 . For casting unsigned => signed, c99 says:

result is implementation-defined or an implementation-defined signal is raised

should we:

tell PVS to ignore the cast, and continue with the bad conversion?
or show an error and refuse the conversion attempt?

mqudsi · 2020-02-16T00:41:13Z

I imagine the correct approach here would actually be to use uint8_t all the way down, e.g. in mbyte.h/mbyte.c (and probably just ignore the bad cast in the interim)?

The worst that could happen is that on platforms that actually do throw a signal on cast overflow/underflow (do we support them in the first place?), SIGABRT or similar will be thrown.. which the current code does for all platforms when the assert fails, so it'll be strictly an improvement.

bfredl · 2020-02-16T09:26:00Z

I imagine the correct approach here would actually be to use uint8_t all the way down, e.g. in mbyte.h/mbyte.c (and probably just ignore the bad cast in the interim)?

What does this even mean? uint_8t cannot store codepoints, and for strings they are already being used more or less exclusively in mbyte code (via the synomym char_u*)

The worst that could happen is that on platforms that actually do throw a signal on cast overflow/underflow (do we support them in the first place?), SIGABRT or similar will be thrown.. which the current code does for all platforms when the assert fails, so it'll be strictly an improvement.

Before getting stuck in the same obsession about casting, assertions, checked conversions etc shouldn't we consider the actual semantics of the code and the end goals that it is trying to accomplish?

What the code does is converting a codepoint to UTF-8. The limit of a codepoint is defined to be 0x10FFFF (and not what the limit of some int type happens to be), but because $REASONS vim still handles values up to 0x7FFFFFFF. Beyond that vim seems to silently wrap around values (to random unrelated codepoints). The proper handling of invalid codepoints (if we don't want to error out the entire file read, which I think we don't) is to change them to the replacement char U+FFFD �. Ideally we would replace all values over 0x10FFFF ( #6325 ), but as to fix the conversion issue, we could replace values above INT_MAX to 0xFFFD already. Perhaps there is some flag that also should be set that the conversion was destructive (which it already is).

mqudsi · 2020-02-16T16:45:15Z

Yes, but I'm not talking about the actual conversion here but rather the ingestion of unknown data that may or may not be valid.

I'm not talking about the actual Unicode-aware logic but merely the mechanics of converting individual bytes to a Unicode codepoint. Given that you are going from n bytes to a may-or-may-not-be-valid-Unicode codepoint and you want to evaluate the validity (or lack thereof) of the maybe-codepoint, you can't assume that the up-to-four-byte sequence you read will lie in the 0x000000-0x10FFFF range. If you want to make your decision about whether or not a particular sequence of four bytes is valid Unicode after having created it (vs determining whether ingesting a particular, single byte will result in the codepoint being formed being out-of-range, which is just complicating things for no good reason) then you cannot assume that this sequence of bytes will fit in a signed 4-byte integer upon which you predicate your validity test.

That said, I agree with everything else you said about lossy conversion and not aborting the process merely because an invalid codepoint was found, but I think we disagree about what the ideal behavior when an invalid codepoint is encountered should be. IMHO, in an ideal world and in a greenfield implementation if I load a binary file as UTF-8, I would like to still be able to save it in neovim without corrupting its original contents: e.g. there should be a difference between the lossy representation of a sequence of bytes in the editor and the actual underlying value. I believe the editor should show U+FFFD but preserve the actual values unless the user selects the � character and replaces it with something else.

In fact, I think that I would prefer emitting invalid UTF-8 and having the terminal deal with it over actively destructive manipulation of the input bytestream (even if it causes neovim and the terminal to differ on the length (in on-screen cells) of invalid sequences which could cause many other problems.)

bfredl · 2020-02-16T17:02:07Z

To treat it as a signed 32-bit integer is not correct, and uint8_t should be used.

Again, a codepoint doesn't fit into uint8_t. Do you mean uint32_t should be used? I don't think this helps very much because even vim:s improper-utf-8 is still 31 bits and not 32 bits.

, you can't assume that the up-to-four-byte sequence you read will lie in the 0x000000-0x10FFFF range

This is about reading UTF-32. Not about reading UTF-8. We are reading exactly four bytes at a time here with no assumptions.

. IMHO, in an ideal world and in a greenfield implementation if I load a binary file as UTF-8, I would like to still be able to save it in neovim without corrupting its original contents

100% agree and #6325 preserves this, but this is a separate question from UTF-32 to UTF-8 conversion which was the code path discussed in this issue. Invalid UTF-32 cannot be fully preserved as only 31 bits can be represented and not 32. (you can of course edit it as binary, but then you get no conversion for valid chars). The code path in vim is already destructive, I just want to make it deterministically destructive (always replacement char, instead of the whims of the gods of Implementation Defined Behavior) and somewhat closer to "sane" unicode semantics even if not fully there there.

mqudsi · 2020-02-16T17:02:56Z

Sorry, I had actually edited my comment but you were too quick for me. I was interrupted non-stop in writing it and ended up with a mess.

mqudsi · 2020-02-16T17:07:22Z

I don't think it has to be destructive at all. Ideally edits can be converted into whatever codepage the file is specified (via fenc) as having (after all, replacing existing valid characters or adding new ones is no different than selecting the � marker and replacing that with valid characters) but unmodified content would be preserved.

bfredl · 2020-02-16T17:07:44Z

No worries, but let's focus on UTF-32 reading only for now, we can consider overlong UTF-8 in the linked PR and forget about it for now. The concrete thing I'm suggesting for this particular conversion issue is to replace values over INT_MAX with the replacement char. Any objection to that?

mqudsi · 2020-02-16T17:12:26Z

The concrete thing I'm suggesting for this particular conversion issue is to replace values over INT_MAX with the replacement char. Any objection to that?

Would an integration test that opens a binary file, then only calls :e ++enc=utf-32 followed by :w result in a change to the contents of the on-disk file from the previous behavior? (That wouldn't necessarily be unacceptable, but it would be worth being explicitly clear about.)

bfredl · 2020-02-16T17:14:22Z

it has to be destructive because vim/nvim cannot represent 2^32 codepoints, only 2^31. This is a fundamental limit that ain't gonna change anytime soon. I'm just saying we should use a deterministic � value as suggested by the unicode standard rather than what the C implementation happens to be giving us (as vim relies upon).

Would an integration test that opens a binary file, calls :e ++enc=utf-32 followed by :w result in a regression in a change to the contents of the on-disk file from the previous behavior? (That wouldn't necessarily be unacceptable, but it would be worth being explicitly clear about.)

One destructive behavior is changed to another destructive behavior. I suppose it is subjective if this counts as a "regression" (spacebar heating).

mqudsi · 2020-02-18T00:12:06Z

I thought you were saying "it is already destructive because vim uses signed types and that results in implementation-specific behavior" and was prepared to argue about theoretical vs actual destructive behavior but no, it's a lot more broken than that and it is/was completely destructive right now, regardless of platform or implementation-specific behavior.

Swapping one destructive behavior for (a much more reasonable) another is perfectly fine with me. As far as I'm now concerned, the result of such an operation can be treated as (neo)vim's equivalent of undefined behavior and it's fair game to change it.

zubairabid · 2020-03-11T00:14:09Z

Hi, I'd like to try working on fixing this issue, if possible. I replicated the bug. Could you confirm if my understanding of the problem is correct?

What is going wrong

UTF-8 is designed to handle up to 0x10FFFF, but vim for some reason handles values up to 0x7FFFFFFF

Oct	NOV	Mar
	30
2019	2020	2022

neovim / neovim

Assertion failure editing invalid UTF-32 #11877

Assertion failure editing invalid UTF-32 #11877

mqudsi commented Feb 15, 2020 •

edited

justinmk commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020

bfredl commented Feb 16, 2020

mqudsi commented Feb 16, 2020 •

edited

bfredl commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020

bfredl commented Feb 16, 2020

mqudsi commented Feb 16, 2020 •

edited

bfredl commented Feb 16, 2020 •

edited

mqudsi commented Feb 18, 2020

zubairabid commented Mar 11, 2020

bfredl commented Mar 11, 2020

zubairabid commented Mar 13, 2020

neovim / neovim

Sponsor neovim/neovim

Join GitHub today

GitHub is where the world builds software

Assertion failure editing invalid UTF-32 #11877

Assertion failure editing invalid UTF-32 #11877

Comments

mqudsi commented Feb 15, 2020 • edited

Steps to reproduce using nvim -u NORC

Actual behaviour

Expected behaviour

justinmk commented Feb 16, 2020 • edited

mqudsi commented Feb 16, 2020

bfredl commented Feb 16, 2020

mqudsi commented Feb 16, 2020 • edited

bfredl commented Feb 16, 2020 • edited

mqudsi commented Feb 16, 2020 • edited

mqudsi commented Feb 16, 2020

bfredl commented Feb 16, 2020

mqudsi commented Feb 16, 2020 • edited

bfredl commented Feb 16, 2020 • edited

mqudsi commented Feb 18, 2020

zubairabid commented Mar 11, 2020

What is going wrong

Suggested solution

bfredl commented Mar 11, 2020

zubairabid commented Mar 13, 2020

Essential cookies

Always active

Analytics cookies

mqudsi commented Feb 15, 2020 •

edited

Steps to reproduce using `nvim -u NORC`

justinmk commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020 •

edited

bfredl commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020 •

edited

mqudsi commented Feb 16, 2020 •

edited

bfredl commented Feb 16, 2020 •

edited