none
Disturbing implicit encoding using "Add-Content" and "echo ... | out-file ..." Be careful!! RRS feed

  • General discussion

  • In many tutorials I read, it's always written that the following two instructions are equivalent:

    Add-Content -path $myFile -value $line
    echo $line | Out-File -append $myFile

    But this is not true.  Those authors are probably still staying in the computing of the 80's !!  We are now in the 21st Century and Unicode is everywhere!

    Try the following code and you'll see that "add-content" and "out-file" apply implicit encodings which is very very disturbing!  Be careful!  Do not mix these two commands if your strings do not have pure 7-bit ASCII characters; or else your text file is not usable!

    $myFile_ac = "c:\temp\f-add-content.txt"
    $myFile_echo = "c:\temp\f-echo.txt"
    
    $line = "12345 ¾ 67890"
    
    # This gives ANSI as implicit encoding (file size = 15 bytes)
    Add-Content -path $myFile_ac -value $line
    
    # This gives UTF-16 LE as implicit encoding (file size = 32 bytes)
    echo $line | Out-File -append $myFile_echo

    For "out-file", I found that I could explicitly specify the encoding, eg

    echo $line | Out-File -append -Encoding UTF8 $myFile

    However, for "add-content", I'm unable to find how to specify the encoding.  Anyone knows?

    Thanks in advance

    Friday, February 19, 2016 5:10 PM

All replies

  • ANSI/ASCII encoding is 8 bit and not 7.  We haven't used 7 bit since the old teletype era.


    \_(ツ)_/

    Friday, February 19, 2016 5:19 PM
  • This behavior of the different commands has been blogged and documented since the fort version of PowerShell.  Don't mix Out-File and Add-Content without explicitly specifying encoding as yu have just discovered.

    Here are a tom of articles referencing this: https://www.google.com/?gws_rd=ssl#newwindow=1&q=powershell+file+output+encoding+issues


    \_(ツ)_/

    Friday, February 19, 2016 5:24 PM
  • ANSI/ASCII encoding is 8 bit and not 7.  We haven't used 7 bit since the old teletype era.


    \_(ツ)_/

    Wrong.  ASCII (aka basic ASCII, simple ASCII) is 7-bit, even if the data is packed inside a 8-bit byte.

    What you're talking is Extended ASCII in which some characters are using 8-bit.

    On the other hand, I wrote:

    Do not mix these two commands if your strings do not have pure 7-bit ASCII characters; or else your text file is not usable!

    Friday, February 19, 2016 5:24 PM
  • Add-Content might have a parameter to specify this in v5:

    https://technet.microsoft.com/en-us/library/hh849859%28v=wps.640%29.aspx

    I think the docs are a bit out of whack.

    Anything before v5 does not have this parameter.


    Friday, February 19, 2016 5:25 PM
  • ANSI/ASCII encoding is 8 bit and not 7.  We haven't used 7 bit since the old teletype era.


    \_(ツ)_/

    Wrong.  ASCII (aka basic ASCII, simple ASCII) is 7-bit, even if the data is packed inside a 8-bit byte.

    What you're talking is Extended ASCII in which some characters are using 8-bit.

    On the other hand, I wrote:

    Do not mix these two commands if your strings do not have pure 7-bit ASCII characters; or else your text file is not usable!

    You forget about the ABSI part. ANSI ASCII is the old extended ASCII.  ANSI always allows the full character set,


    \_(ツ)_/

    Friday, February 19, 2016 5:38 PM
  • My best guess is that you are saying that when outputting from console all characters are handle as Unicode and when output using various methods we may get mixed output. Yes - this is true and us a side-effect of the default behavior of some CmdLets. Early PowerShell was not "pure" Unicode.  THe earlier CmdLets may not behave well when mixed.

    Yes - your are correct.  It is something to be aware of.  It get us all periodically.

    I tend to stick with Out-File for consistency.  Use of -encoding is encouraged when working in a multi-cultural environment.

    All of these same issues have always plagued programmers.


    \_(ツ)_/

    Friday, February 19, 2016 5:50 PM
  • Here is a fairly complete article on ASCII.  It notes that the Windows 1252ccharacter set is an 8 bit ASCII based character set closely related to ISO-8859-1

    Here is PowerShell in the US

    PS C:\scripts> $OutputEncoding
    
    
    IsSingleByte      : True
    BodyName          : us-ascii
    EncodingName      : US-ASCII
    HeaderName        : us-ascii
    WebName           : us-ascii
    WindowsCodePage   : 1252
    IsBrowserDisplay  : False
    IsBrowserSave     : False
    IsMailNewsDisplay : True
    IsMailNewsSave    : True
    EncoderFallback   : System.Text.EncoderReplacementFallback
    DecoderFallback   : System.Text.DecoderReplacementFallback
    IsReadOnly        : True
    CodePage          : 20127
    Windows 1252 is a superset of us-ascii and is 8 bits.  The characters are mapped from the ascii7 set  but allows for 8 bits. Most extended characters can only be input with keypad.

    The name 'ASCII' is used by convention. Notice that notepad and most editors call this ANSI.


    \_(ツ)_/

    Friday, February 19, 2016 6:22 PM
  • Here are the caveats and disclaimers to the ANSI arguments: https://en.wikipedia.org/wiki/Windows-1252


    \_(ツ)_/

    Friday, February 19, 2016 6:24 PM
  • Putting the off-topic about ASCII aside, so for compatibility reason (at least from Win7 up), I concluded that it is just not possible to specify encoding with Add-Content and therefore this command MUST not be used at all cost (in the context and in the name of Unicode).
    Friday, February 19, 2016 6:44 PM
  • Add-Content -encoding works on Win7 and all other systems. 

    \_(ツ)_/

    Friday, February 19, 2016 6:48 PM
  • But PowerShell in Win7 is at version 2 by default, no?

    Just checked: PowerShell in my Win 7 is version 2.

    But that article said it's for v5


    • Edited by Horinius Friday, February 19, 2016 6:59 PM
    Friday, February 19, 2016 6:58 PM
  • We should all be running  WMF 4 or later. V2 is almost all but obsolete and does not support most PowerShell modules for new systems.


    \_(ツ)_/

    Friday, February 19, 2016 7:08 PM
  • Digression:

       As much as Microsoft would like everybody to upgrade his Win7 & 8 to Win10, this won't happen soon - not in professional sector or consumer sector.  Last Saturday I had a meeting with my bank and saw that they are using Win 7.  That's already very good!  Some others would still be using XP in the professional world!

       The same for WMF 4 & PowerShell & whatever is out there in the real world.

    End of digression.

    ____________

    Is it possible to define "UTF8" as default encoding?  So I don't have to repeat this command like this all the times:

    Out-File -Encoding UTF8 ......

    Monday, February 22, 2016 11:59 AM
  • Digression:

    For information:

    Basic ASCII is also called US-ASCII and it's only using effective 7-bits

    http://www.kermitproject.org/ascii.html

    Monday, February 22, 2016 12:03 PM
  • But when Windows says ASCII it means ANSI which is really Windows 1252 which is 8 bits mad includes the extended set.

    This has been an od issue since I started programming Windows at 1.0..  We did a lot of home rolled applications that used serial communications.  We also had systems that used Teletype machines for input and output.  Until KSR 40 series these were all only 7 bit. The mix of bit-ness and character sets caused massive confusion at times.


    \_(ツ)_/

    Monday, February 22, 2016 2:47 PM
  • Digression:

       As much as Microsoft would like everybody to upgrade his Win7 & 8 to Win10, this won't happen soon - not in professional sector or consumer sector.  Last Saturday I had a meeting with my bank and saw that they are using Win 7.  That's already very good!  Some others would still be using XP in the professional world!

       The same for WMF 4 & PowerShell & whatever is out there in the real world.

    End of digression.

    ____________

    Is it possible to define "UTF8" as default encoding?  So I don't have to repeat this command like this all the times:

    Out-File -Encoding UTF8 ......

    There is no XP or WS2003 anymore. They have been discontinued and out of support for years.  The documentation says clearly WMF 3.0 and later.


    \_(ツ)_/

    Monday, February 22, 2016 2:51 PM

  • But when Windows says ASCII it means ANSI which is really Windows 1252 which is 8 bits mad includes the extended set.

    Not true either.  The notion of ASCII is very confused for Microsoft/Windows.  Look at this article which is published by Microsoft:

    https://msdn.microsoft.com/en-us/library/4z4t9ed1%28v=vs.71%29.aspx

    The article is very short.  There are three occurrences of the term ASCII:

    1. "The ASCII character code charts contain the ...."
         Here, it's implying 8 bit but its meaning was not clear at first reading.

    2. "... of the extended ASCII (American ..."
         This part is unambiguous ==> 8 bit

    3. "The extended character set includes the ASCII character set and 128 other characters for graphics..."
        Here ASCII stands for 7-bit

    It's not clear how to interpret the term ASCII for them.  Or maybe

    ASCII character code charts ==> 8 bit
    ASCII character set ==> 7 bit?

    Anyway, Microsoft articles never have the reputations of good quality or being clear, I wouldn't expect too much from it.  That is why I was using words very carefully in my first post.  I wrote "7-bit ASCII" to make it unambiguous.

    Nevertheless:

    This has been an od issue since I started programming Windows at 1.0..  We did a lot of home rolled applications that used serial communications.  We also had systems that used Teletype machines for input and output.  Until KSR 40 series these were all only 7 bit. The mix of bit-ness and character sets caused massive confusion at times.

    Yes, I knew this confusion with people, even experienced programmers.  It's sad.  It's the fault of bad education or educators and I feel sorry for you or others in the same situations.  It's never too late to correct the wrongs.

    Tuesday, February 23, 2016 9:43 AM
  • There is no XP or WS2003 anymore. They have been discontinued and out of support for years.  The documentation says clearly WMF 3.0 and later.


    There is a difference between "being supported by Microsoft" and "being supported by the public".  It's not because Microsoft decided to kill XP that all XP on Earth stop working.  Don't take me wrong.  I don't use XP and I'm persuading my entourage to move to Win7 or higher.

    But the reality is that there are still XP out there.  Thinking XP is not supported is just too naïve.

    End of digression

    So I concluded it's not possible to define "UTF-8" as default encoding and I have to specify "-Encoding UTF8" in every command of "Out-File".

    Tuesday, February 23, 2016 9:49 AM
  • You miss the point/ Microsoft  uses 1252 which is a superset of ASCII2 (Extended Ascii).  MS refers to it as ASCII and ANSI (in notepad).  This is because Windows 1252 supports classic ASIC and ANSI ASCII so MS has been loose with the terms.  Windows is NOT ASCII.  It is Windows 1252 which is a superset of ISO.  Microsoft can get away with this because it pretty much wrote the standards for ISO and for US Extended ASCII.

    It is also safe to say the standard ASCII no longer exists a standard and in rolled into the ANSI and ISO standardswhich is what we all use no.  Even Windows 1252 is disappearing.

    Remember that MS calls 1252 US-ASCII:

    PS C:\scripts> $OutputEncoding
    
    
    IsSingleByte      : True
    BodyName          : us-ascii
    EncodingName      : US-ASCII
    HeaderName        : us-ascii
    WebName           : us-ascii
    WindowsCodePage   : 1252

    But it is in no way a 7 bit character set.  It is clearly 8 bit.

    So in the end your team looses and has to buy the beer.  Sorry.

     


    \_(ツ)_/



    • Edited by jrv Tuesday, February 23, 2016 9:57 AM
    Tuesday, February 23, 2016 9:56 AM