none
DATALENGTH is not counting no-breakspace (0xA0) characters in SQL Server RRS feed

  • Question

  • DATALENGTH is not counting no-breakspace (0xA0) characters while returning the number of Bytes in a string in SQL Server. In PostgreSQL, OCTET_LENGTH which count the number of bytes in a specified string includes 0xA0 as well while returning result set. Is this a bug in SQL Server?

    Example - 

    Both of these below return 9 as output - 

    SELECT DATALENGTH('Blah Blah'); --contains no-breakspace

    You can look up the no-break space character on the above select via => https://unicodelookup.com/


    SELECT DATALENGTH('Blah Blah');

    Is there any other function available in SQL server that would count this in as well ? LEN did not help. 

     


    ramya


    Friday, November 22, 2019 9:16 PM

Answers

  • I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes 

    The string is UTF-8 encoded in PostgreSQL. The resultant number of bytes is 10 because 8 bytes are needed for the 8 ASCII characters in the 0-127 code point range plus 2 bytes for the no-breakspace character.

    Only 9 bytes are needed with SQL Server varchar because non-ASCII characters in the 128-255 code point range are mapped according to the collation code page. So only 1 byte is needed to store the no-breakspace character without Unicode encoding.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, November 23, 2019 2:54 AM

All replies

  • I do not see the difference between both SELECT statements.

    A Fan of SSIS, SSRS and SSAS

    Friday, November 22, 2019 9:28 PM
  • You can lookup the difference with - https://unicodelookup.com/

    ramya

    Friday, November 22, 2019 9:31 PM
  • -- 9
    SELECT DATALENGTH('Blah Blah');
    -- 11
    SELECT DATALENGTH('Blah Blah
    ');

    A Fan of SSIS, SSRS and SSAS

    Friday, November 22, 2019 9:34 PM
  • May I know which version of SQL Server you're trying? I have attached a screenshot of what I see in 13.x


    ramya

    Friday, November 22, 2019 9:40 PM
  • Try this:

    SELECT DATALENGTH('Blah Blah
    ');


    A Fan of SSIS, SSRS and SSAS

    Friday, November 22, 2019 9:44 PM
  • oh, the break is intentional :) got it! but my concern hers is I couldn't detect no-break space character

    ramya

    Friday, November 22, 2019 9:57 PM
  • You may check if it exists using

    DECLARE @s varchar(20) = 'Blah Blah
    ';
    SELECT CHARINDEX(CHAR(13), @s);


    A Fan of SSIS, SSRS and SSAS

    Friday, November 22, 2019 10:02 PM
  • DATALENGTH is not counting no-breakspace (0xA0) characters while returning the number of Bytes in a string in SQL Server. In PostgreSQL, OCTET_LENGTH which count the number of bytes in a specified string includes 0xA0 as well while returning result set. Is this a bug in SQL Server?

    Example - 

    Both of these below return 9 as output - 

    SELECT DATALENGTH('Blah Blah'); --contains no-breakspace

    So what you do expect this to return? I can't but see that there are nine bytes in that string.

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Friday, November 22, 2019 11:09 PM
  • Thank you for your response Erland. Let me explain what I am trying to achieve here. I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes - 

    postgres=> SELECT octet_length('blah blah');

     octet_length 

    --------------

               10

    (1 row)

    CHAR_LENGTH however 9 returns but this function just counts the number of character not returns the number of bytes. 

    postgres=> SELECT char_length('blah blah');

     char_length 

    -------------

               9

    (1 row)

    postgres=> 

    And so I am trying to figure out why there is a mismatch with result set of the data returned from DATALENGTH and OCTET_LENGTH functions.

    <style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px} span.s1 {font-variant-ligatures: no-common-ligatures} </style>

    ramya


    Saturday, November 23, 2019 2:01 AM
  • I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes 

    The string is UTF-8 encoded in PostgreSQL. The resultant number of bytes is 10 because 8 bytes are needed for the 8 ASCII characters in the 0-127 code point range plus 2 bytes for the no-breakspace character.

    Only 9 bytes are needed with SQL Server varchar because non-ASCII characters in the 128-255 code point range are mapped according to the collation code page. So only 1 byte is needed to store the no-breakspace character without Unicode encoding.


    Dan Guzman, Data Platform MVP, http://www.dbdelta.com

    Saturday, November 23, 2019 2:54 AM
  • To add to Dan's answer: you will get back 10, if you do this exercise on SQL 2019 in a database with a UTF-8 collation. It is also possible that you can get 10 back in a database with a collation for a far-Eastern language, which uses multi-byte sequences to encode letters outside the ASCII range.

    You can also see the actual encoding by casting the value to binary:

    SELECT convert(varbinary(20), 'Blah Blah')


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Saturday, November 23, 2019 10:53 AM