The Wayback Machine - https://web.archive.org/web/20250620010118/https://github.com/python/cpython/pull/28189
Skip to content

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification. #28189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Rafaelblsilva
Copy link

@Rafaelblsilva Rafaelblsilva commented Sep 6, 2021

bpo-45120: Updated windows 'cp' encodings to match 'bestfit' specification.

There is a mismatch in specification and behavior in some windows encodings.

Some older windows codepages specifications present "UNDEFINED" mapping, whereas in reality, they present another behavior which is updated in a section named "bestfit".

For example CP1252 has a corresponding bestfit1525:

They have the following differences:

  • In CP1252, bytes \x81 \x8d \x8f \x90 \x9d maps to "UNDEFINED".

  • In bestfit1252, they map to \u0081 \u008d \u008f \u0090 \u009d respectively.

In the Windows API, the function 'MultiByteToWideChar' in "cp1252 mode" exhibits the bestfit1252 behavior.

This issue and PR proposes a correction for this behavior, updating the windows codepages where some code points where defined as "UNDEFINED" to the corresponding bestfit mapping.

For example: "b'\x81'.decode('cp1252')" produces:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

Where, i think, the intended behavior would be:

>>> b'\x81'.decode('cp1252')
'\x81'

This PR:

  • updates the Makefile in Tools/unicode/ adding 'windows-bestfit'. Adding some pre-processing to the files bestfit mapping files downloaded from ftp.unicode.org.

  • Adds the direct output from the gencodec.py corresponding to the cp encodings in Lib/encodings/

https://bugs.python.org/issue45120

@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept this contribution by verifying everyone involved has signed the PSF contributor agreement (CLA).

CLA Missing

Our records indicate the following people have not signed the CLA:

@Rafaelblsilva

For legal reasons we need all the people listed to sign the CLA before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

If you have recently signed the CLA, please wait at least one business day
before our records are updated.

You can check yourself to see if the CLA has been received.

Thanks again for the contribution, we look forward to reviewing it!

@Rafaelblsilva
Copy link
Author

@malemburg since I've used your unicode/gencodec.py tool and tweaked its makefile you might want to take a look at it.
Also your review and opinion would be extremely helpful on this matter, thanks!

@github-actions
Copy link

github-actions bot commented Oct 8, 2021

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale Stale PR or inactive for long period of time. label Oct 8, 2021
@github-actions github-actions bot removed the stale Stale PR or inactive for long period of time. label Aug 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants