bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification.
There is a mismatch in specification and behavior in some windows encodings.
Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit".
For example CP1252 has a corresponding bestfit1525:
CP1252: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
bestfit1525: https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
They have the following differences:
In CP1252, bytes \x81 \x8d \x8f \x90 \x9d maps to "UNDEFINED".
In bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d respectively.
In the Windows API, the function 'MultiByteToWideChar' in "cp1252 mode" exhibits the bestfit1252 behavior.
This issue and PR proposes a correction for this behavior, updating the windows codepages where some code points where defined as "UNDEFINED" to the corresponding bestfit mapping.
For example: "b'\x81'.decode('cp1252')" produces:
Where, i think, the intended behavior would be:
This PR:
updates the Makefile in Tools/unicode/ adding 'windows-bestfit'. Adding some pre-processing to the files bestfit mapping files downloaded from ftp.unicode.org.
Adds the direct output from the gencodec.py corresponding to the cp encodings in Lib/encodings/
https://bugs.python.org/issue45120