bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

Rafaelblsilva · 2021-09-06T20:47:11Z

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification.

There is a mismatch in specification and behavior in some windows encodings.

Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit".

For example CP1252 has a corresponding bestfit1525:

They have the following differences:

In CP1252, bytes \x81 \x8d \x8f \x90 \x9d maps to "UNDEFINED".
In bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d respectively.

In the Windows API, the function 'MultiByteToWideChar' in "cp1252 mode" exhibits the bestfit1252 behavior.

This issue and PR proposes a correction for this behavior, updating the windows codepages where some code points where defined as "UNDEFINED" to the corresponding bestfit mapping.

For example: "b'\x81'.decode('cp1252')" produces:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

Where, i think, the intended behavior would be:

>>> b'\x81'.decode('cp1252')
'\x81'

This PR:

updates the Makefile in Tools/unicode/ adding 'windows-bestfit'. Adding some pre-processing to the files bestfit mapping files downloaded from ftp.unicode.org.
Adds the direct output from the gencodec.py corresponding to the cp encodings in Lib/encodings/

https://bugs.python.org/issue45120

the-knights-who-say-ni · 2021-09-06T20:47:14Z

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@Rafaelblsilva

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

Rafaelblsilva · 2021-09-07T19:51:27Z

@malemburg since I've used your unicode/gencodec.py tool and tweaked its makefile you might want to take a look at it.
Also your review and opinion would be extremely helpful on this matter, thanks!

github-actions · 2021-10-08T00:05:30Z

This PR is stale because it has been open for 30 days with no activity.

Rafaelblsilva added 2 commits September 6, 2021 16:55

Updated windows codepages in accordance to 'bestfit' behavior

facdd9a

Reverted python binary from python3 to python in makefile

faa46ef

the-knights-who-say-ni added the CLA not signed label Sep 6, 2021

bedevere-bot added the awaiting review label Sep 6, 2021

Added Misc/NEWS.d with blurb

0dd33c7

Rafaelblsilva mentioned this pull request Sep 6, 2021

Fix encoding mismatch cp1252 -> latin1 PyMySQL/mysqlclient#502

Closed

the-knights-who-say-ni added CLA signed and removed CLA not signed labels Sep 7, 2021

github-actions bot added the stale Stale PR or inactive for long period of time. label Oct 8, 2021

rafaelblsilva mannequin mentioned this pull request Apr 10, 2022

Windows cp encodings "UNDEFINED" entries update #89283

Open

ezio-melotti removed the CLA signed label Jul 13, 2022

github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 12, 2022

Rafaelblsilva closed this Oct 16, 2023

May	JUN	Jul
	20
2024	2025	2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

Uh oh!

Rafaelblsilva commented Sep 6, 2021 •

edited by bedevere-bot

Loading

Uh oh!

the-knights-who-say-ni commented Sep 6, 2021

Uh oh!

Rafaelblsilva commented Sep 7, 2021

Uh oh!

github-actions bot commented Oct 8, 2021

Uh oh!

Uh oh!

Uh oh!

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

Uh oh!

Conversation

Rafaelblsilva commented Sep 6, 2021 • edited by bedevere-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!