Answered by:
DATALENGTH is not counting no-breakspace (0xA0) characters in SQL Server

Question
-
DATALENGTH is not counting no-breakspace (0xA0) characters while returning the number of Bytes in a string in SQL Server. In PostgreSQL, OCTET_LENGTH which count the number of bytes in a specified string includes 0xA0 as well while returning result set. Is this a bug in SQL Server?
Example -
Both of these below return 9 as output -
SELECT DATALENGTH('Blah Blah'); --contains no-breakspace
You can look up the no-break space character on the above select via => https://unicodelookup.com/
SELECT DATALENGTH('Blah Blah');Is there any other function available in SQL server that would count this in as well ? LEN did not help.
ramya
- Edited by Ramya Vijayaraghavan Friday, November 22, 2019 9:33 PM
Answers
-
I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes
The string is UTF-8 encoded in PostgreSQL. The resultant number of bytes is 10 because 8 bytes are needed for the 8 ASCII characters in the 0-127 code point range plus 2 bytes for the no-breakspace character.
Only 9 bytes are needed with SQL Server varchar because non-ASCII characters in the 128-255 code point range are mapped according to the collation code page. So only 1 byte is needed to store the no-breakspace character without Unicode encoding.
Dan Guzman, Data Platform MVP, http://www.dbdelta.com
- Marked as answer by Ramya Vijayaraghavan Sunday, November 24, 2019 4:29 AM
All replies
-
-
-
-
-
-
-
-
DATALENGTH is not counting no-breakspace (0xA0) characters while returning the number of Bytes in a string in SQL Server. In PostgreSQL, OCTET_LENGTH which count the number of bytes in a specified string includes 0xA0 as well while returning result set. Is this a bug in SQL Server?
Example -
Both of these below return 9 as output -
SELECT DATALENGTH('Blah Blah'); --contains no-breakspace
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
-
Thank you for your response Erland. Let me explain what I am trying to achieve here. I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes -
postgres=> SELECT octet_length('blah blah');
octet_length
--------------
10
(1 row)
CHAR_LENGTH however 9 returns but this function just counts the number of character not returns the number of bytes.
postgres=> SELECT char_length('blah blah');
char_length
-------------
9
(1 row)
postgres=>
And so I am trying to figure out why there is a mismatch with result set of the data returned from DATALENGTH and OCTET_LENGTH functions.
<style type="text/css">p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; background-color: #ffffff; min-height: 13.0px} span.s1 {font-variant-ligatures: no-common-ligatures} </style>
ramya
- Edited by Ramya Vijayaraghavan Saturday, November 23, 2019 2:02 AM
-
I am migrating a SQL server database to PostgreSQL and trying to validate data migrated. When the SQL Server DATALENGTH function returns 9 bytes, the PostgreSQL function OCTET_LENGTH returns 10 bytes
The string is UTF-8 encoded in PostgreSQL. The resultant number of bytes is 10 because 8 bytes are needed for the 8 ASCII characters in the 0-127 code point range plus 2 bytes for the no-breakspace character.
Only 9 bytes are needed with SQL Server varchar because non-ASCII characters in the 128-255 code point range are mapped according to the collation code page. So only 1 byte is needed to store the no-breakspace character without Unicode encoding.
Dan Guzman, Data Platform MVP, http://www.dbdelta.com
- Marked as answer by Ramya Vijayaraghavan Sunday, November 24, 2019 4:29 AM
-
To add to Dan's answer: you will get back 10, if you do this exercise on SQL 2019 in a database with a UTF-8 collation. It is also possible that you can get 10 back in a database with a collation for a far-Eastern language, which uses multi-byte sequences to encode letters outside the ASCII range.
You can also see the actual encoding by casting the value to binary:
SELECT convert(varbinary(20), 'Blah Blah')
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se