none
ASCII values for extended characters RRS feed

  • Dotaz

  • Hi all,

    I am building a T-SQL script where I need to get the ASCII value of a character. That was easy enough using the ASCII() function. When analyzing the text data, I found that it contains some extended characters, for example, the "a" with an accent: "á". This is ASCII char 160:

    ASCII Table for char(160)

    However, when I get the ASCII in T-SQL, it returns 225:

    T-SQL ASCII of accented a

    Why is this? I am currently looking for more information that will explain what is going on. I suspect that it might have to do with character sets. On this hunch I got the character set that the database is using:

    Database character set

    Thank you for your time and assistance. Saga


    You can't take the sky from me

    pátek 18. října 2019 21:07

Odpovědi

  • To be pedantic, ASCII values are in the range from 32 to 127. That is what is covered by the ASCII standard. Not more.

    Today, there are a lot of character sets in use, but about all of them agree with ASCII with the range 32 to 127. However, for the code points beyond 127 they may disagree.

    In Windows terms, a character set is called a code page. It we restrict ourselves to code pages used for character sets west of India (i.e. Latin, Cyrillic, Greek, Hebrew and a few more scripts), these code pages encompass code points up to 255, that is as much as you can fit into a single byte.

    Obviously, code pages for Latin and Greek will be different, since they will include different characters. A code page for Greek includes a glyph for alpha which is missing from a Latin code page, which on the other hand may include á.

    To make things more complicated, when these eight-bit sets were introduced in the end of the eighties, different vendors invented their own "extended ASCII". I don't know the exact history, but it was the definition used by Digital that prevailed as the standard known as Latin-1 which later became the foundation for Unicode. In Latin-1 code point 160 is no-break-space, and lowercase a with acute accent is found at position 225.
    The chart you are looking at is for a different set, I would guess one of CP437 and CP850 which were the code pages that IBM devised and used for their PC. Both of these are intended for Western Europe, but CP850 is better for multi-national use. Windows still use any of these in the command-line window for backwards compatibility.

    SQL Server does have some collations that support these legacy code pages, and you can see which code page a collation is for by running

    SELECT collationproperty('collationname', 'CodePage')

    The collation you have in your screen shot uses code page 1252, which is the same as Latin-1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    pátek 18. října 2019 21:46
  • I don't know what reference you used but "extended ASCII" is a misnomer. ASCII characters are in the 0-127 code point range and the remaining 128-255 code point characters are determined by the code page. This should have been called out in the reference.

    You are on the right track that the different characters and code points are related to the collation since the collation determines the code page SQL Server uses for char/varchar data. SQL_Latin_General_CP1_CI_AS uses Windows code page 1252 whereas the reference might use code page 437. To wit:

    SELECT ASCII((SELECT 'á' COLLATE SQL_Latin1_General_CP1_CI_AS)); --225
    SELECT ASCII((SELECT 'á' COLLATE SQL_Latin1_General_CP437_CI_AS)); --160

    So the same character can have different code points depending on the collation's code page.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    pátek 18. října 2019 21:56
  • My two cent:

    1. character set is not the same as code page

    UNICODE is a character set for example but you have several Encoding for this character set like UTF-16LE, UTF-16BE, UTF-7, UTF-8, and so on. The Code Page is the name of the Encoding, for example the above encoding has the code pages: 1200, 1201, 65000, 65001, and so on. So we can see that a single character set (UNICODE) have multiple Code Points, since it has multiple Character Encoding.

    In short here are the main terms:

    > "Character" is a logical entity used by people.
    > “Encoded Value” is set of zero and one which is what Computers use 
    > “Code Points” are simply a range of numbers, which used to map between characters and the encoded values.
    > "Character set" is a group of characters, mapped to specific range of code points. Same character SET name can be used for different groups of characters like in the case of Extended ASCII, but these will be mapped to the same range of Code Points.
    > “Character Encoding” is the algorithm we use to encode the code points into Encoded values
    > "Code page” is a number/name which represent the “Character Encoding”

    The terms are a bit different from one documentation to the other (unfortunately) which lead to a lot of confusion.

    As Erland said, different vendors invented their own Encoding for the character set "extended ASCII". The above image shows Microsoft code points but there are other code points for IBM Encoding for example. For example Code page 862 is a code point for IBM encoding for Hebrew, which by the way it is used under DOS for Hebrew.

    ASCII standard character set was initially composed of 128 characters (7-bit code). The first 32 characters are control characters but as Dan said they are part of ASCII code points range which is 0-127.

    SQL Server uses a parameter name COLLATE to represent multiple properties which configure the rules of how to use the data. One of these properties together with the data type configures the Code Page. For example using COLLATE with data type NVARCHAR will always be UTF16/UCS2 encoding, but with data type VARCHAR different collate can have different Code Points. In SQL 2019 we simply got more COLLATE which support the encoding UTF8.

    In order to get the different properties of each COLLATE we can use the function COLLATIONPROPERTY

    In order to get the COLLATE we have, we can use the function sys.fn_helpcollations()

    Putting these two together the following query will return some of the properties of the COLLATE in the system:

    SELECT col.[name],
           COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
           COLLATIONPROPERTY(col.[name], 'Version') AS [Version],
           COLLATIONPROPERTY(col.[name], 'LCID') AS [Language Codes],
           COLLATIONPROPERTY(col.[name], 'ComparisonStyle') AS [Comparison Style]
    FROM   sys.fn_helpcollations() col

    If you execute this query you can get the code points for the encoding for different COLLATE if we use data types like CHAR and VARCHAR.

    As Erland said code point 1252 is used to represent Latin
    Quote from Wikipedia: "Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and some other Western languages".

    * I think Dan explain the rest well, so no need to repeat it...


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]



    sobota 19. října 2019 8:18
    Moderátor

Všechny reakce

  • To be pedantic, ASCII values are in the range from 32 to 127. That is what is covered by the ASCII standard. Not more.

    Today, there are a lot of character sets in use, but about all of them agree with ASCII with the range 32 to 127. However, for the code points beyond 127 they may disagree.

    In Windows terms, a character set is called a code page. It we restrict ourselves to code pages used for character sets west of India (i.e. Latin, Cyrillic, Greek, Hebrew and a few more scripts), these code pages encompass code points up to 255, that is as much as you can fit into a single byte.

    Obviously, code pages for Latin and Greek will be different, since they will include different characters. A code page for Greek includes a glyph for alpha which is missing from a Latin code page, which on the other hand may include á.

    To make things more complicated, when these eight-bit sets were introduced in the end of the eighties, different vendors invented their own "extended ASCII". I don't know the exact history, but it was the definition used by Digital that prevailed as the standard known as Latin-1 which later became the foundation for Unicode. In Latin-1 code point 160 is no-break-space, and lowercase a with acute accent is found at position 225.
    The chart you are looking at is for a different set, I would guess one of CP437 and CP850 which were the code pages that IBM devised and used for their PC. Both of these are intended for Western Europe, but CP850 is better for multi-national use. Windows still use any of these in the command-line window for backwards compatibility.

    SQL Server does have some collations that support these legacy code pages, and you can see which code page a collation is for by running

    SELECT collationproperty('collationname', 'CodePage')

    The collation you have in your screen shot uses code page 1252, which is the same as Latin-1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    pátek 18. října 2019 21:46
  • I don't know what reference you used but "extended ASCII" is a misnomer. ASCII characters are in the 0-127 code point range and the remaining 128-255 code point characters are determined by the code page. This should have been called out in the reference.

    You are on the right track that the different characters and code points are related to the collation since the collation determines the code page SQL Server uses for char/varchar data. SQL_Latin_General_CP1_CI_AS uses Windows code page 1252 whereas the reference might use code page 437. To wit:

    SELECT ASCII((SELECT 'á' COLLATE SQL_Latin1_General_CP1_CI_AS)); --225
    SELECT ASCII((SELECT 'á' COLLATE SQL_Latin1_General_CP437_CI_AS)); --160

    So the same character can have different code points depending on the collation's code page.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    pátek 18. října 2019 21:56
  • My two cent:

    1. character set is not the same as code page

    UNICODE is a character set for example but you have several Encoding for this character set like UTF-16LE, UTF-16BE, UTF-7, UTF-8, and so on. The Code Page is the name of the Encoding, for example the above encoding has the code pages: 1200, 1201, 65000, 65001, and so on. So we can see that a single character set (UNICODE) have multiple Code Points, since it has multiple Character Encoding.

    In short here are the main terms:

    > "Character" is a logical entity used by people.
    > “Encoded Value” is set of zero and one which is what Computers use 
    > “Code Points” are simply a range of numbers, which used to map between characters and the encoded values.
    > "Character set" is a group of characters, mapped to specific range of code points. Same character SET name can be used for different groups of characters like in the case of Extended ASCII, but these will be mapped to the same range of Code Points.
    > “Character Encoding” is the algorithm we use to encode the code points into Encoded values
    > "Code page” is a number/name which represent the “Character Encoding”

    The terms are a bit different from one documentation to the other (unfortunately) which lead to a lot of confusion.

    As Erland said, different vendors invented their own Encoding for the character set "extended ASCII". The above image shows Microsoft code points but there are other code points for IBM Encoding for example. For example Code page 862 is a code point for IBM encoding for Hebrew, which by the way it is used under DOS for Hebrew.

    ASCII standard character set was initially composed of 128 characters (7-bit code). The first 32 characters are control characters but as Dan said they are part of ASCII code points range which is 0-127.

    SQL Server uses a parameter name COLLATE to represent multiple properties which configure the rules of how to use the data. One of these properties together with the data type configures the Code Page. For example using COLLATE with data type NVARCHAR will always be UTF16/UCS2 encoding, but with data type VARCHAR different collate can have different Code Points. In SQL 2019 we simply got more COLLATE which support the encoding UTF8.

    In order to get the different properties of each COLLATE we can use the function COLLATIONPROPERTY

    In order to get the COLLATE we have, we can use the function sys.fn_helpcollations()

    Putting these two together the following query will return some of the properties of the COLLATE in the system:

    SELECT col.[name],
           COLLATIONPROPERTY(col.[name], 'CodePage') AS [CodePage],
           COLLATIONPROPERTY(col.[name], 'Version') AS [Version],
           COLLATIONPROPERTY(col.[name], 'LCID') AS [Language Codes],
           COLLATIONPROPERTY(col.[name], 'ComparisonStyle') AS [Comparison Style]
    FROM   sys.fn_helpcollations() col

    If you execute this query you can get the code points for the encoding for different COLLATE if we use data types like CHAR and VARCHAR.

    As Erland said code point 1252 is used to represent Latin
    Quote from Wikipedia: "Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and some other Western languages".

    * I think Dan explain the rest well, so no need to repeat it...


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]



    sobota 19. října 2019 8:18
    Moderátor
  • Erland, Dan, and Ronen - thank you for your replies. This is exactly what I was looking  for, to understand the reason behind the "odd" result I got with ASCII('á').

    I guess the function name "ASCII" is a bit misleading since it only returns the ASCII of the char for those whose value is in the 0 to 127 range. For those chars above 127, it will return the value as per the code page being used. I understand that the "extended ASCII" set is an IBM invention for their IBM PC and that it has been adopted as an unofficial "standard"

    Erland, not pedantic at all! There are so many details that, like the telephone game, can result in different interpretations. You'll get no argument from me, as I prefer not to delve too deeply into the fine details; however, I suspect that you will get an argument from Wiki, since the definition presented is as follows:

    "...ASCII encodes 128 specified characters..."

    (https://en.wikipedia.org/wiki/ASCII#Overview)

    Again, thank you all for your assistance! Saga


    You can't take the sky from me


    • Upravený SagaV9 pondělí 21. října 2019 15:05
    pondělí 21. října 2019 15:04
  • Erland, Dan, and Ronen - thank you for your replies. This is exactly what I was looking  for, to understand the reason behind the "odd" result I got with ASCII('á').

    I guess the function name "ASCII" is a bit misleading since it only returns the ASCII of the char for those whose value is in the 0 to 127 range. For those chars above 127, it will return the value as per the code page being used. I understand that the "extended ASCII" set is an IBM invention for their IBM PC and that it has been adopted as an unofficial "standard"

    Erland, not pedantic at all! There are so many details that, like the telephone game, can result in different interpretations. You'll get no argument from me, as I prefer not to delve too deeply into the fine details; however, I suspect that you will get an argument from Wiki, since the definition presented is as follows:

    "...ASCII encodes 128 specified characters..."

    (https://en.wikipedia.org/wiki/ASCII#Overview)

    Again, thank you all for your assistance! Saga


    You can't take the sky from me

    Hi Saga,

    First, thanks for the thanks 👍✔ I appreciate it.
    Unfortunately most people here forget to say thanks and I hate it.

    Back to the discussion...

    I think that you still have a confusion regarding the function ASCII.
    This function is NOT "only returns the ASCII of the char for those whose value is in the 0 to 127 range".

    The function ASCII is actually Extended ASCII function, but for short name Microsoft choose to use the name ASCII. Basically in SQL Server when we speak about ASCII then we probably mean Extended ASCII.

    The function ASCII returns the extended ASCII Code Point between 0 and 255. The encoded values of Code Points in the range 0-127 for Extended ASCII are the same as fro ASCII.

    Another function which we must to mention is the opposite direction function CHAR which get Code Point as input and the returns the character value in Extended ASCII

    Both functions returns the result according to the COLLATE you use.

    I uploaded a new post with include the entire explanation and sample code to demonstrate the behavior of the function ASCII.
    http://ariely.info/Blog/tabid/83/EntryId/244/SQL-Server-2019-the-functions-CHAR-and-ASCII-return-unexpected-result.aspx

    I HIGHLY recommend you to go over over the post from start to end, and if there is still something which is not clear then please ask.

    Have a great day😀


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    úterý 22. října 2019 22:09
    Moderátor