The Wayback Machine - https://web.archive.org/web/20201130032511/https://github.com/neovim/neovim/issues/11877
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failure editing invalid UTF-32 #11877

Open
mqudsi opened this issue Feb 15, 2020 · 14 comments
Open

Assertion failure editing invalid UTF-32 #11877

mqudsi opened this issue Feb 15, 2020 · 14 comments

Comments

@mqudsi
Copy link
Contributor

@mqudsi mqudsi commented Feb 15, 2020

  • nvim --version: NVIM v0.5.0-23-gafd576ee9
  • vim -u DEFAULTS (version: ) behaves differently? yes
  • Operating system/version: Ubuntu 18.0.4.1 via WSL
  • Terminal name/version: conhost
  • $TERM: xterm-256color

Steps to reproduce using nvim -u NORC

Download:
neovim_enc_bug.txt

nvim -u NORC neovim_enc_bug.txt
:e ++enc=utf-32

Actual behaviour

nvim: ../src/nvim/fileio.c:1427: int readfile(char_u *, char_u *, linenr_T, linenr_T, linenr_T, exarg_T *, int): Assertion `u8c <= INT_MAX' failed.

Expected behaviour

No crash. (Vim 8 does not crash.)

@mqudsi mqudsi added the bug label Feb 15, 2020
@justinmk justinmk added this to the 0.4.4 milestone Feb 16, 2020
@justinmk
Copy link
Member

@justinmk justinmk commented Feb 16, 2020

we added this check in 7e4fd04 . For casting unsigned => signed, c99 says:

result is implementation-defined or an implementation-defined signal is raised

should we:

  • tell PVS to ignore the cast, and continue with the bad conversion?
  • or show an error and refuse the conversion attempt?
@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 16, 2020

I imagine the correct approach here would actually be to use uint8_t all the way down, e.g. in mbyte.h/mbyte.c (and probably just ignore the bad cast in the interim)?

The worst that could happen is that on platforms that actually do throw a signal on cast overflow/underflow (do we support them in the first place?), SIGABRT or similar will be thrown.. which the current code does for all platforms when the assert fails, so it'll be strictly an improvement.

@bfredl
Copy link
Member

@bfredl bfredl commented Feb 16, 2020

I imagine the correct approach here would actually be to use uint8_t all the way down, e.g. in mbyte.h/mbyte.c (and probably just ignore the bad cast in the interim)?

What does this even mean? uint_8t cannot store codepoints, and for strings they are already being used more or less exclusively in mbyte code (via the synomym char_u*)

The worst that could happen is that on platforms that actually do throw a signal on cast overflow/underflow (do we support them in the first place?), SIGABRT or similar will be thrown.. which the current code does for all platforms when the assert fails, so it'll be strictly an improvement.

Before getting stuck in the same obsession about casting, assertions, checked conversions etc shouldn't we consider the actual semantics of the code and the end goals that it is trying to accomplish?

What the code does is converting a codepoint to UTF-8. The limit of a codepoint is defined to be 0x10FFFF (and not what the limit of some int type happens to be), but because $REASONS vim still handles values up to 0x7FFFFFFF. Beyond that vim seems to silently wrap around values (to random unrelated codepoints). The proper handling of invalid codepoints (if we don't want to error out the entire file read, which I think we don't) is to change them to the replacement char U+FFFD �. Ideally we would replace all values over 0x10FFFF ( #6325 ), but as to fix the conversion issue, we could replace values above INT_MAX to 0xFFFD already. Perhaps there is some flag that also should be set that the conversion was destructive (which it already is).

@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 16, 2020

Yes, but I'm not talking about the actual conversion here but rather the ingestion of unknown data that may or may not be valid.

I'm not talking about the actual Unicode-aware logic but merely the mechanics of converting individual bytes to a Unicode codepoint. Given that you are going from n bytes to a may-or-may-not-be-valid-Unicode codepoint and you want to evaluate the validity (or lack thereof) of the maybe-codepoint, you can't assume that the up-to-four-byte sequence you read will lie in the 0x000000-0x10FFFF range. If you want to make your decision about whether or not a particular sequence of four bytes is valid Unicode after having created it (vs determining whether ingesting a particular, single byte will result in the codepoint being formed being out-of-range, which is just complicating things for no good reason) then you cannot assume that this sequence of bytes will fit in a signed 4-byte integer upon which you predicate your validity test.

That said, I agree with everything else you said about lossy conversion and not aborting the process merely because an invalid codepoint was found, but I think we disagree about what the ideal behavior when an invalid codepoint is encountered should be. IMHO, in an ideal world and in a greenfield implementation if I load a binary file as UTF-8, I would like to still be able to save it in neovim without corrupting its original contents: e.g. there should be a difference between the lossy representation of a sequence of bytes in the editor and the actual underlying value. I believe the editor should show U+FFFD but preserve the actual values unless the user selects the � character and replaces it with something else.

In fact, I think that I would prefer emitting invalid UTF-8 and having the terminal deal with it over actively destructive manipulation of the input bytestream (even if it causes neovim and the terminal to differ on the length (in on-screen cells) of invalid sequences which could cause many other problems.)

@bfredl
Copy link
Member

@bfredl bfredl commented Feb 16, 2020

To treat it as a signed 32-bit integer is not correct, and uint8_t should be used.

Again, a codepoint doesn't fit into uint8_t. Do you mean uint32_t should be used? I don't think this helps very much because even vim:s improper-utf-8 is still 31 bits and not 32 bits.

, you can't assume that the up-to-four-byte sequence you read will lie in the 0x000000-0x10FFFF range

This is about reading UTF-32. Not about reading UTF-8. We are reading exactly four bytes at a time here with no assumptions.

. IMHO, in an ideal world and in a greenfield implementation if I load a binary file as UTF-8, I would like to still be able to save it in neovim without corrupting its original contents

100% agree and #6325 preserves this, but this is a separate question from UTF-32 to UTF-8 conversion which was the code path discussed in this issue. Invalid UTF-32 cannot be fully preserved as only 31 bits can be represented and not 32. (you can of course edit it as binary, but then you get no conversion for valid chars). The code path in vim is already destructive, I just want to make it deterministically destructive (always replacement char, instead of the whims of the gods of Implementation Defined Behavior) and somewhat closer to "sane" unicode semantics even if not fully there there.

@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 16, 2020

Sorry, I had actually edited my comment but you were too quick for me. I was interrupted non-stop in writing it and ended up with a mess.

@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 16, 2020

I don't think it has to be destructive at all. Ideally edits can be converted into whatever codepage the file is specified (via fenc) as having (after all, replacing existing valid characters or adding new ones is no different than selecting the � marker and replacing that with valid characters) but unmodified content would be preserved.

@bfredl
Copy link
Member

@bfredl bfredl commented Feb 16, 2020

No worries, but let's focus on UTF-32 reading only for now, we can consider overlong UTF-8 in the linked PR and forget about it for now. The concrete thing I'm suggesting for this particular conversion issue is to replace values over INT_MAX with the replacement char. Any objection to that?

@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 16, 2020

The concrete thing I'm suggesting for this particular conversion issue is to replace values over INT_MAX with the replacement char. Any objection to that?

Would an integration test that opens a binary file, then only calls :e ++enc=utf-32 followed by :w result in a change to the contents of the on-disk file from the previous behavior? (That wouldn't necessarily be unacceptable, but it would be worth being explicitly clear about.)

@bfredl
Copy link
Member

@bfredl bfredl commented Feb 16, 2020

it has to be destructive because vim/nvim cannot represent 2^32 codepoints, only 2^31. This is a fundamental limit that ain't gonna change anytime soon. I'm just saying we should use a deterministic � value as suggested by the unicode standard rather than what the C implementation happens to be giving us (as vim relies upon).

Would an integration test that opens a binary file, calls :e ++enc=utf-32 followed by :w result in a regression in a change to the contents of the on-disk file from the previous behavior? (That wouldn't necessarily be unacceptable, but it would be worth being explicitly clear about.)

One destructive behavior is changed to another destructive behavior. I suppose it is subjective if this counts as a "regression" (spacebar heating).

@mqudsi
Copy link
Contributor Author

@mqudsi mqudsi commented Feb 18, 2020

I thought you were saying "it is already destructive because vim uses signed types and that results in implementation-specific behavior" and was prepared to argue about theoretical vs actual destructive behavior but no, it's a lot more broken than that and it is/was completely destructive right now, regardless of platform or implementation-specific behavior.

Swapping one destructive behavior for (a much more reasonable) another is perfectly fine with me. As far as I'm now concerned, the result of such an operation can be treated as (neo)vim's equivalent of undefined behavior and it's fair game to change it.

@zubairabid
Copy link

@zubairabid zubairabid commented Mar 11, 2020

Hi, I'd like to try working on fixing this issue, if possible. I replicated the bug. Could you confirm if my understanding of the problem is correct?

What is going wrong

UTF-8 is designed to handle up to 0x10FFFF, but vim for some reason handles values up to 0x7FFFFFFF

Suggested solution

Replace values above INT_MAX to 0xFFFD

@bfredl
Copy link
Member

@bfredl bfredl commented Mar 11, 2020

The suggested solution is the correct one here, but "What is going wrong" is a separate issue, addressed by #6325 potentially.

@zubairabid
Copy link

@zubairabid zubairabid commented Mar 13, 2020

Wrote a fix in PR #12010 . It currently follows the expected behaviour @mqudsi suggested.

I didn't address the overall "What is going wrong" bit here, just the replacement of large u8c with the appropriate standard character.

@jamessan jamessan modified the milestones: 0.4.4, 0.4.5 Aug 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.