This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients ezio.melotti, jaraco, lemburg, loewis, vstinner
Date 2014-07-11.14:04:49
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <[email protected]>
In-reply-to
Content
> The BOM (byte order mark) appears in the standard input stream. When using cmd.exe, the BOM is not present. This behavior occurs in CP1252 as well as CP65001.

How you do change the console encoding? Using the chcp command?

I'm surprised that you get a UTF-8 BOM when the code page 1252 is used. Can you please check that sys.stdin.encoding is "cp1252"?


I tested PowerShell with Python 3.5 on Windows 7 with an OEM code page 850 and ANSI code page 1252:

- by default, the stdin encoding is cp850 (OEM code page) and os.device_encoding(0) returns "cp850". sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe (ex: echo "abc"|python ...), the stdin encoding becomes cp1252 (ANSI code page) because os.device_encoding(0) returns None; cp1252 is the result of locale.getpreferredencoding(False) (ANSI code page). sys.stdin.readline() does not contain a BOM.

If I change the console encoding using the command "chcp 65001":

- by default, the stdin encoding = os.device_encoding(0) = "cp65001".  sys.stdin.readline() does not contain a BOM.

- when stdin is a pipe, stdin encoding = locale.getpreferredencoding(False) = "cp1252" and sys.stdin.readline() *contains* the UTF-8 BOM

Note: The UTF-8 BOM is only written once, before the first character.

So the UTF-8 BOM is only written in one case under these conditions:

- Python is running in PowerShell (The UTF-8 BOM is not written in cmd.exe, even with chcp 65001)
- sys.stdin is a pipe
- the console encoding was set manually to cp65001

--

It looks like PowerShell decodes the output of the producer program (echo, type, ...) and then encodes the output to the consumer program (ex: python).

It's possible to change the encoding of the encoder by setting $OutputEncoding variable. Example to encode to UTF-8 without the BOM:

   $OutputEncoding = New-Object System.Text.UTF8Encoding($False)

Example to encode to UTF-8 without the BOM:

   $OutputEncoding = [System.Text.Encoding]::UTF8

Using [System.Text.Encoding]::UTF8, sys.stdin.readline() starts with a BOM even if the console encoding is cp850. If you set the console encoding to 65001 (chcp 65001) and $OutputEncoding to [System.Text.Encoding]::UTF8, you get... two UTF-8 BOMs... yeah!

I tried different producer programs: [MS-DOS] echo "abc", [PowerShell] write-output "abc", [MS-DOS] type document.txt, [PowerShell] Get-Content document.txt, python -c "print('abc')". It doesn't like like using a different program changes anything. The UTF-8 BOM is added somewhere by PowerShell between by producer and the consumer programs.

To show the console input and output encodings in PowerShell, type "[console]::InputEncoding" and "[console]::OutputEncoding".

See also:
http://stackoverflow.com/questions/22349139/utf8-output-from-powershell
History
Date User Action Args
2014-07-11 14:04:50vstinnersetrecipients: + vstinner, lemburg, loewis, jaraco, ezio.melotti
2014-07-11 14:04:50vstinnersetmessageid: <[email protected]>
2014-07-11 14:04:50vstinnerlinkissue21927 messages
2014-07-11 14:04:49vstinnercreate