none
questation about powershell and encoding RRS feed

  • Question

  • hi all,

    sorry for my englisch . i have questaton about abou file encoding . I have file with ISO-8859-2 . But if I call in the powershell funkcion OpenText , the currentencoding is UTF8 . How i change it ? The file is large and if I try opent with get-content the file is open about hour.... thanx


    Falcon

    Saturday, July 12, 2014 4:56 PM

Answers

  • This article may help to explain how this is used and how it has evolved:

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752(v=vs.85).aspx


    ¯\_(ツ)_/¯

    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:21 PM
  • Cmdlets like Get-Content don't allow you to select more than a few common encodings, but you can access the underlying .NET Framework to do whatever you like. For example:

    $iso8859_2 = [System.Text.Encoding]::GetEncoding('ISO-8859-2')
    $path = 'C:\Some\File.txt'
    
    # Opening a file and enumerating all of its lines:
    
    foreach ($line in [System.IO.File]::ReadLines($path, $iso8859_2)) { }
    
    # Obtaining a streamreader to the file, for finer control
    # over how you read it:
    
    $streamReader = New-Object System.IO.StreamReader($path, $iso8859_2)


    • Edited by David Wyatt Saturday, July 12, 2014 5:23 PM
    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:22 PM
  •  ISO-8859-2 is not an encoding it is a character set.  Files do not specify character sets.  Character sets are specified by a system or by a document such as an HTML or XML document.

    I believe that ISO-8859-2 is the default Windows Central European character set. It is known as Latin II.  ISO-8859-1 is Latin I and is what is used in US Windows.  (See: http://en.wikipedia.org/wiki/ISO/IEC_8859-1)

    Out-File and other file creation commands have an uncoding which determines how the characters are stored.   The default for Windows 7 and later is Unicode.

    Type: help out-file -parameter encoding

    ISO-8859-2 can be stored in any encoding except ASCII-7 as it is only 8 bits.  Windows always stores ASCII-7 as ASCII.

    All Unicode encodings can store ISO-8859-2.

    Here is the ISO-8859 to Unicode  spec: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT


    ¯\_(ツ)_/¯

    I'm not sure it's a good idea to make the distinction between a character set and an encoding here.  When it comes to interpreting the bytes in a file, those two concepts are linked.  Even though Unicode encoding can be used to represent every character that ISO-8859-2 can represent, they don't always have the same byte values.  This code outputs their 57 differences:

    $iso8859_2 = [System.Text.Encoding]::GetEncoding('ISO-8859-2')
    
    $differences = for ($isoCode = 0; $isoCode -lt 256; $isoCode++)
    {
        $isoString = $iso8859_2.GetString($isoCode)
        $unicodeBytes = [System.Text.Encoding]::Unicode.GetBytes($isoString)
    
        $unicodeCode = [System.BitConverter]::ToInt16($unicodeBytes, 0)
    
        if ($unicodeCode -ne $isoCode)
        {
            [pscustomobject] @{
                Character = $isoString
                ISO8859_2_Code = $isoCode
                Unicode_Code = $unicodeCode
            }
        }
    }
    
    $differences | Format-Table -AutoSize
    

    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:31 PM

All replies

  •  ISO-8859-2 is not an encoding it is a character set.  Files do not specify character sets.  Character sets are specified by a system or by a document such as an HTML or XML document.

    I believe that ISO-8859-2 is the default Windows Central European character set. It is known as Latin II.  ISO-8859-1 is Latin I and is what is used in US Windows.  (See: http://en.wikipedia.org/wiki/ISO/IEC_8859-1)

    Out-File and other file creation commands have an uncoding which determines how the characters are stored.   The default for Windows 7 and later is Unicode.

    Type: help out-file -parameter encoding

    ISO-8859-2 can be stored in any encoding except ASCII-7 as it is only 8 bits.  Windows always stores ASCII-7 as ASCII.

    All Unicode encodings can store ISO-8859-2.

    Here is the ISO-8859 to Unicode  spec: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT


    ¯\_(ツ)_/¯

    Saturday, July 12, 2014 5:15 PM
  • This article may help to explain how this is used and how it has evolved:

    http://msdn.microsoft.com/en-us/library/windows/desktop/dd317752(v=vs.85).aspx


    ¯\_(ツ)_/¯

    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:21 PM
  • Cmdlets like Get-Content don't allow you to select more than a few common encodings, but you can access the underlying .NET Framework to do whatever you like. For example:

    $iso8859_2 = [System.Text.Encoding]::GetEncoding('ISO-8859-2')
    $path = 'C:\Some\File.txt'
    
    # Opening a file and enumerating all of its lines:
    
    foreach ($line in [System.IO.File]::ReadLines($path, $iso8859_2)) { }
    
    # Obtaining a streamreader to the file, for finer control
    # over how you read it:
    
    $streamReader = New-Object System.IO.StreamReader($path, $iso8859_2)


    • Edited by David Wyatt Saturday, July 12, 2014 5:23 PM
    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:22 PM
  • Here is how to get all of the encodings available:

    [system.text.encoding]::GetEncodings()

    You can use GetEncoding.

    [system.text.encoding]::GetEncoding(1252)  #ISO-8859-1
    [system.text.encoding]::GetEncoding(1250)  #ISO-8859-2


    ¯\_(ツ)_/¯

    Saturday, July 12, 2014 5:27 PM
  •  ISO-8859-2 is not an encoding it is a character set.  Files do not specify character sets.  Character sets are specified by a system or by a document such as an HTML or XML document.

    I believe that ISO-8859-2 is the default Windows Central European character set. It is known as Latin II.  ISO-8859-1 is Latin I and is what is used in US Windows.  (See: http://en.wikipedia.org/wiki/ISO/IEC_8859-1)

    Out-File and other file creation commands have an uncoding which determines how the characters are stored.   The default for Windows 7 and later is Unicode.

    Type: help out-file -parameter encoding

    ISO-8859-2 can be stored in any encoding except ASCII-7 as it is only 8 bits.  Windows always stores ASCII-7 as ASCII.

    All Unicode encodings can store ISO-8859-2.

    Here is the ISO-8859 to Unicode  spec: ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-2.TXT


    ¯\_(ツ)_/¯

    I'm not sure it's a good idea to make the distinction between a character set and an encoding here.  When it comes to interpreting the bytes in a file, those two concepts are linked.  Even though Unicode encoding can be used to represent every character that ISO-8859-2 can represent, they don't always have the same byte values.  This code outputs their 57 differences:

    $iso8859_2 = [System.Text.Encoding]::GetEncoding('ISO-8859-2')
    
    $differences = for ($isoCode = 0; $isoCode -lt 256; $isoCode++)
    {
        $isoString = $iso8859_2.GetString($isoCode)
        $unicodeBytes = [System.Text.Encoding]::Unicode.GetBytes($isoString)
    
        $unicodeCode = [System.BitConverter]::ToInt16($unicodeBytes, 0)
    
        if ($unicodeCode -ne $isoCode)
        {
            [pscustomobject] @{
                Character = $isoString
                ISO8859_2_Code = $isoCode
                Unicode_Code = $unicodeCode
            }
        }
    }
    
    $differences | Format-Table -AutoSize
    

    • Marked as answer by Marek G_ Saturday, July 12, 2014 9:46 PM
    Saturday, July 12, 2014 5:31 PM
  • See Windows Character set anomalies.  Windows 1252 is actually an extended ISO so the sets will not map.


    ¯\_(ツ)_/¯

    Saturday, July 12, 2014 5:34 PM
  • Hey @jrv,

    I know this is an old thread, but it's sufficiently complex to fry my brain. I am looking to ensure that a text string is assigned to the proper character set before being base64 encoded for use in HTTP Authorization header. As far as I understand it, these should use "iso-8859-1". This will be use in PowerShell

    If you run [System.Text.Encoding]::GetEncoding("ISO-8859-1"), you get the following

    C:\Users\swin.HPELITEBOOK> [System.Text.Encoding]::GetEncoding("ISO-8859-1")
    
    
    IsSingleByte      : True
    BodyName          : iso-8859-1
    EncodingName      : Western European (ISO)
    HeaderName        : iso-8859-1
    WebName           : iso-8859-1
    WindowsCodePage   : 1252
    IsBrowserDisplay  : True
    IsBrowserSave     : True
    IsMailNewsDisplay : True
    IsMailNewsSave    : True
    EncoderFallback   : System.Text.InternalEncoderBestFitFallback
    DecoderFallback   : System.Text.InternalDecoderBestFitFallback
    IsReadOnly        : True
    CodePage          : 28591

    And if you run [System.Text.Encoding]::GetEncoding(1252) you get:

    C:\Users\swin.HPELITEBOOK> [System.Text.Encoding]::GetEncoding(1252)
    
    
    IsSingleByte      : True
    BodyName          : iso-8859-1
    EncodingName      : Western European (Windows)
    HeaderName        : Windows-1252
    WebName           : Windows-1252
    WindowsCodePage   : 1252
    IsBrowserDisplay  : True
    IsBrowserSave     : True
    IsMailNewsDisplay : True
    IsMailNewsSave    : True
    EncoderFallback   : System.Text.InternalEncoderBestFitFallback
    DecoderFallback   : System.Text.InternalDecoderBestFitFallback
    IsReadOnly        : True
    CodePage          : 1252
    
    

    Lastly, if you run [system.text.encoding]::GetEncodings()  |findstr iso-8859-1, you get:

    C:\Users\swin.HPELITEBOOK> [system.text.encoding]::GetEncodings()  |findstr iso-8859-1
       28591 iso-8859-1              Western European (ISO)
       28603 iso-8859-13             Estonian (ISO)
       28605 iso-8859-15             Latin 9 (ISO)

    So the Windows Code Page seems to differ (28581 -> 1252), but the Codepage remains the same (1252).

    Does it therefore matter if you use

     [System.Text.Encoding]::GetEncoding(1252)

    or

    [System.Text.Encoding]::GetEncoding(28581)

    or even

    [System.Text.Encoding]::GetEncoding("iso-8859-1")

    Cheers

    Chris


    Monday, August 7, 2017 5:17 PM