none
Can I insert UTF-8 XML into SQL Server? RRS feed

  • Question

  • Hi,

    Based on forum posts, it seems that only SQL Server 2019 starts to support UTF-8. Then for early versions of SQL Server, can I insert UTF-8 XML into SQL Server.

    Sample:

    I have a database called XMLDB, with a table called XMLTable, with a column called XMLCol, whose type is XML.

    Can I use the following two methods to insert XML data with UTF-8 characters:

    Method 1. 

    USE XMLDB;

    INSERT INTO XMLTable VALUES ('<?xml version="1.0" encoding="utf-8" ?>
    <sample1>
    <company>测试</company>
    <city>Wheaton</city>
    <year-founded>1998</year-founded>
    <industry>software</industry>
    </sample1>')

    Method 2. 

    INSERT INTO XMLTable
    SELECT * FROM OPENROWSET(BULK N'E:\Temp\Test.xml', SINGLE_BLOB) AS x;

    where Test.xml is a XML file with some UTF-8 characters, as the contents in the method 1.

    If yes, then need I always specify the encoding in the XML header explicitly, like this:

    <?xml version="1.0" encoding="utf-8" ?>

    Wednesday, July 10, 2019 12:03 AM

Answers

  • Good day Yitzhak,

    What you do is exactly what I explained. If this is not clear please re-read and I can try to clarify better as well but check it:

    Your first Option (under the "-- no prolog, non-English characters") works well since you use national character "N" and inside the content you do not use any XML declaration!

    therefore, there is no contradiction

    Your second option works well since you use national character "N" and inside the content you do use XML declaration encoding="UTF-16"

    Again, national character data types are encoded with "UTF-16" which mean there is no contradiction. This is the preferred option which I mentioned when we use direct INSERT or variable by the way! Using  explicit XML declaration encoding="UTF-16" and using national character

    I do not see exactly what you added to what I said above😃
    These are two options which I mentioned

     -------------------------------------

    Note! In what you wrote before the samples there are some inaccurate. You must re-read what I wrote before since it is all there!

    >> XML prolog is omitted. Accepts both English as well as non-English characters

    NOT ALWAYS! If you use XML declaration but you do not use national character "N" and you do not use the new 2019 UTF-8 COLLATE, then your input is not UNICODE. Check the example above or bellow and remember (1) that SQL Server explicitly convert the data to UTF-16 when you insert it to XML data type but you cannot restore characters which were lost when the text was not UNICODE. (2) that this can be impacted by your current COLLATE and the characters you use in the input

    >> XML prolog specifies encoding="UTF-16". Accepts both English as well as non-English character

    Same as before!

    If you use encoding="UTF-16" and you use national character, then it will work - this is exactly your second sample and this is the option  I wrote as preferred.

    But if you do not use national character then this will fail! This is the same in any version, and is not related only to non-English character.

    SELECT @@VERSION
    -- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
    
    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-16" standalone = "no" ?>
    <contact-info>
       <name>Tanmay Patil</name>
       <company>TutorialsPoint</company>
       <phone>(011) 123-4567</phone>
    </contact-info>'
    SELECT @_XML
    GO -- Error! XML parsing: line 1, character 62, unable to switch the encoding

    The explanation is simple! You have contradiction between the input which is not UNICODE (assuming I am not using the 2019 UTF-8 COLLATE), and the XML DECLARATION which say that this is UTF-16

    Check this regarding your next claim...

    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-8" ?><Jap>测试</Jap>'
    SELECT @_XML
    GO -- Might lost data and not work well in all languages/collate!
    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-8" ?><Heb>א</Heb>'
    SELECT @_XML
    GO -- Might lost data and not work well in all languages/collate!

     

    ... and so on...

    The magic word is contradiction between the XML DECLARATION and the INPUT DATA.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]


    • Edited by pituachMVP Thursday, July 11, 2019 12:42 PM
    • Marked as answer by tempc Saturday, July 20, 2019 11:31 AM
    Thursday, July 11, 2019 9:11 AM
  • Hi tempc,

    The upcoming SQL Server 2019 support for UTF-8 is referring to the regular columns with data types VARCHAR(), CHAR(), and the like. I am not sure about XML data type.

    You should be able to insert your XML with non-English characters  into a column of the XML data type starting with SQL Server 2005 and later.

    I would guess that you have DB collation compatible with Chinese ?!

    I am guessing that company name element value is in Chinese.

    You would need to remove XML prolog with encoding="UTF-8" from the XML. It is because currently SQL Server uses UTF-16 internally for the XML data type. The only prolog with encoding="UTF-16" is accepted for the Chinese characters.

    SQL:

    DECLARE @tbl TABLE (ID int IDENTITY(1,1), XmlColumn XML);
    
    DECLARE @xml XML = N'<sample1>
    	<company>测试</company>
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml);
    
    SELECT *, CAST(XmlColumn AS NVARCHAR(MAX))
    FROM @tbl;

    SQL to deal with the file, encoding utf-16:

    DECLARE @xml XML;
    
    SELECT @xml = XmlDoc   
    FROM OPENROWSET (BULK N'e:\Temp\Chinese.xml', SINGLE_BLOB, CODEPAGE='65001') AS Tab(XmlDoc);
    
    -- shred into a rectangular format
    ;WITH rs AS
    (
       SELECT col.value('(r)[1]','NVARCHAR(100)') AS title
       FROM @xml.nodes('/root') AS tab(col)
    )
    SELECT * FROM rs;
    
    -- load into a table
    DECLARE @tbl TABLE (ID int IDENTITY(1,1),XmlColumn xml);
    
    INSERT INTO @tbl (XmlColumn)  
    SELECT * FROM OPENROWSET(BULK N'e:\Temp\Chinese.xml', SINGLE_BLOB, CODEPAGE='65001') AS x;
    
    SELECT * FROM @tbl;

    Output:



    XML file:

    <?xml version="1.0" encoding="UTF-16"?>
    <root>
    	<r>测试</r>
    </root>
    • Edited by Yitzhak Khabinsky Thursday, July 11, 2019 4:55 AM
    • Marked as answer by tempc Saturday, July 20, 2019 11:32 AM
    Wednesday, July 10, 2019 12:36 AM
  • Good day all 😃

    Here is my 2 cent

    >>  it seems that only SQL Server 2019 starts to support UTF-8.

    Not accurate. SQL Server 2019 is the first version which support storing text in UTF-8 encoding as a simple text type. This is done by storing text using simple data types like CHAR and VARCHAR and using one of the new UTF-8 COLLATE.

    Before version 2019 there are several levels of supports in UTF-8, but if we want to use text data type we could not use simple types and we had to use special types like NCHAR/NVARCHAR and the text was stored in UTF-16 (from 2012 and UCS2 in older versions).

    * We could always store the content as binary and not as text
    * Export and import flat files which are saved as UTF-8 encoding using BCP utility and the option to the BULK INSERT Transact-SQL command - was added in SQL Server 2014 SP2. This means that the external file is in UTF-8 but while importing the data there is implicit convert to UTF-16 if we store the data in NCHAR, NVARCHAR, or XML data type.

    >> for early versions of SQL Server, can I insert UTF-8 XML into SQL Server.

    By the way, I usually answer YES for most "can we do" questions! The question should be How and not if we can. 

    Obviously we can, but before version 2014 SP2 we had to first EXPLICITLY CONVERT the file to UTF-16 and then import it, while from 2014 SP2 we can import the file in UTF-8 directly into SQL Server NCHAR/NVARCHAR/XML data types - in this case there will be IMPLICIT CONVERT to UTF-16

    Notice again that the data in the database in this case is in UTF-16 even so the source is in UTF-8

    >> Sample...<?xml version="1.0" encoding="utf-8" ?>

    Method 1 is really problematic since INSERT string value directly to a table is totally different from IMPORT DATA using a utility like bcp or using BULK INSERT. Moreover, you did not provide the DDL so we have no idea what is the column's type.

    If you INSERT the data into NCHAR/NVARCHAR then this is totally not relevant. In this case, you can write in the text encoding="RonenAriely" and this will not change the encoding of the data. This is a simple text which has nothing to do with the encoding of the stored data. In this case the text is inserted as it is.

    In both samples in this case, the data is stored in UTF-16 encoding (or UCS2 in older versions). This is true in sql server 2019 as well.

    If you are using a data types like CHAR/VARCHAR with a COLLATE that does not support UTF-8 (which only exists in 2019) then it will be stored as extended ASCII encoded according to the COLLATE which you use.

    The news in 2019 is that from SQL Server 2019 we have COLLATEs which support UTF-8. If you used one of these COLLATE which support UTF-8 then the data in CHAR/VARCHAR will be stored using UTF-8.

    If you INSERT the data into XML data type then this is much more complex story since XML data type has more limitations and rules. In this case, the server actually parse the text and check the content of the text in any manipulation of the data. The server confirms that the value fits to Microsoft SQL Server's rules regarding this data type, and change the data! The stored data is NOT the same as the INSERTED text.

    For example if the content of the text is not valid XML '<open><x></x>' then you will get an error which give you some information of the issue. In this sample you will get "unexpected end of input" since we have an opening node <open> without a closer.

    If you IMPORT data from a file then the XML declaration must use the same encoding as the file, or no properties for the encoding. If the file is saved using UTF-8 then we need to use encoding="utf-8", and if the file is stored using UTF-16 then you need to use encoding="utf-16".

    When you use INSERT like in your Method 1, then you actually use CHAR/VARCHAR as input (since in your sample you did not used the national character "N" before the text). In this case the server will explicitly convert the data into UTF-16 and change the data according to the internal rules I will mention.

    In this case if the text includes only ASCII characters then and you use encoding="utf-8" then the server will insert the text well

    DECLARE @_XML XML 
    SET @_XMl = 
    '<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
    </NodeA>'
    SELECT @_XML
    GO -- ok

    But if the text includes NON-ASCII characters then you will get error about illegal xml character:

    DECLARE @_XML XML 
    SET @_XMl = 
    '<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- Error

    If you will use encoding="utf-8" and national character "N" then you will get error "unable to switch the encoding", since the input text is now in UTF-16 and since you use XML data type the server parse the text and expect to get text in UTF-8.

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- Error

    Same issue we will get if we will do the opposite: Using UTF-16 for the input (by use the "N" before the text) and inside the content of the file you use encoding = "UTF-8".

    In short, THE INPUT MUST FIT THE CONTENT IF YOU ARE USING XML DATA TYPE

    The right way to INSERT the value directly into XML data type is to use national character "N" and inside the XML content to tell the server that you are using UTF-16, since the input is in UTF-16

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<?xml version = "1.0" encoding = "UTF-16" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- OK

     

    >> If yes, then need I always specify the encoding in the XML header explicitly, like this:  <?xml version="1.0" encoding="utf-8" ?>

    I assume that you speak about XML Data type

    The answer is No, but it is recommended.

    If you IMPORT the data then the server can get the input encoding from your command. In any case the server will try to use implicit convert to UTF-16. In SQL Server you do not have to use the XML declaration at all (not only the encoding) and in fact the server remove this part when it store the data.

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML

    >>  I am not sure about XML data type.

    In general (not sql server specifically), when we use XML then we usually need to support UNICODE but we can use different encoding.

    >> I would guess that you have DB collation compatible with Chinese ?!

    This is not relevant for the INSERT of the data.

    XML data type use UNICODE UTF-16 encoding for any collate you choose. The COLLATE has other impact like sorting the data, comparing values and so on. But in NCHAR, NVARCHAR, and XML data type the stored data is always in UTF-16 (or UCS2 in older versions)

    You can confirm this by using the undocumented commands DBCC IND and DBCC PAGE 

    >> You would need to remove XML prolog with encoding="UTF-8" from the XML.

    Check what I wrote above regarding different cases. If this is am XML file that you import then having the XML declaration is probably needed in the file and you do not need to remove it. Using direct INSERT you simply mad a mistake since your INPUT was not in UTF-8 and therefore the XML declaration which declare that this is UTF-8 was wrong and did not fit the real encoding of the data - check what I wrote above

    >> SQL to deal with the file, encoding utf-16:

    Actually as I explained SQL Server from 2014 SP2 can deal with files in UTF-8 as well.

     - - - - - - - - - - - - - - -  

    Summery!

    • XML DATA Type is stored in UTF-16 encoding (this is not related to the server version, exactly as NVARCHAR is UTF-16)
    • SQL Server 2014 SP2 and above can import files which are in UTF-8 encoding
    • IMPORT from file and INSERT from value does not behave the same
    • If we use XML data type then the input source must fit the content text encoding parameter in the XML declaration
    • XML data type in SQL Server does not store the data "as it is" in the input text! The stored data will not include the "XML declaration" (the server remove this during the insert), the server might change the order of nodes, the server encode some of the characters and will not store them as they are in the text.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    • Marked as answer by tempc Saturday, July 20, 2019 11:32 AM
    Thursday, July 11, 2019 2:30 AM

All replies

  • Hi tempc,

    The upcoming SQL Server 2019 support for UTF-8 is referring to the regular columns with data types VARCHAR(), CHAR(), and the like. I am not sure about XML data type.

    You should be able to insert your XML with non-English characters  into a column of the XML data type starting with SQL Server 2005 and later.

    I would guess that you have DB collation compatible with Chinese ?!

    I am guessing that company name element value is in Chinese.

    You would need to remove XML prolog with encoding="UTF-8" from the XML. It is because currently SQL Server uses UTF-16 internally for the XML data type. The only prolog with encoding="UTF-16" is accepted for the Chinese characters.

    SQL:

    DECLARE @tbl TABLE (ID int IDENTITY(1,1), XmlColumn XML);
    
    DECLARE @xml XML = N'<sample1>
    	<company>测试</company>
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml);
    
    SELECT *, CAST(XmlColumn AS NVARCHAR(MAX))
    FROM @tbl;

    SQL to deal with the file, encoding utf-16:

    DECLARE @xml XML;
    
    SELECT @xml = XmlDoc   
    FROM OPENROWSET (BULK N'e:\Temp\Chinese.xml', SINGLE_BLOB, CODEPAGE='65001') AS Tab(XmlDoc);
    
    -- shred into a rectangular format
    ;WITH rs AS
    (
       SELECT col.value('(r)[1]','NVARCHAR(100)') AS title
       FROM @xml.nodes('/root') AS tab(col)
    )
    SELECT * FROM rs;
    
    -- load into a table
    DECLARE @tbl TABLE (ID int IDENTITY(1,1),XmlColumn xml);
    
    INSERT INTO @tbl (XmlColumn)  
    SELECT * FROM OPENROWSET(BULK N'e:\Temp\Chinese.xml', SINGLE_BLOB, CODEPAGE='65001') AS x;
    
    SELECT * FROM @tbl;

    Output:



    XML file:

    <?xml version="1.0" encoding="UTF-16"?>
    <root>
    	<r>测试</r>
    </root>
    • Edited by Yitzhak Khabinsky Thursday, July 11, 2019 4:55 AM
    • Marked as answer by tempc Saturday, July 20, 2019 11:32 AM
    Wednesday, July 10, 2019 12:36 AM
  • Hi, Yitzhak,

    Yes, my collation is Chinese.

    As for the encoding="UTF-8", to my understand, I think SQL Server requires the encoding information for the XML data or file. When importing XML data, it will interpret the XML data based on "encoding" information, then convert the XML data into its internal storage which is UTF-16.

    I check the document at https://docs.microsoft.com/en-us/sql/relational-databases/xml/load-xml-data?view=sql-server-2017 and it said "If no such server code page exists, you have to add an explicit XML declaration with the correct encoding."
    • Edited by tempc Wednesday, July 10, 2019 2:55 AM
    Wednesday, July 10, 2019 2:52 AM
  • Hi tempc,

    Just run the both code samples I provided to see how it works.

    By the way, the Chinese language encoding needs both BOM and utf-16 for XML files, and not the utf-8.

    Wednesday, July 10, 2019 3:23 AM
  • Good day all 😃

    Here is my 2 cent

    >>  it seems that only SQL Server 2019 starts to support UTF-8.

    Not accurate. SQL Server 2019 is the first version which support storing text in UTF-8 encoding as a simple text type. This is done by storing text using simple data types like CHAR and VARCHAR and using one of the new UTF-8 COLLATE.

    Before version 2019 there are several levels of supports in UTF-8, but if we want to use text data type we could not use simple types and we had to use special types like NCHAR/NVARCHAR and the text was stored in UTF-16 (from 2012 and UCS2 in older versions).

    * We could always store the content as binary and not as text
    * Export and import flat files which are saved as UTF-8 encoding using BCP utility and the option to the BULK INSERT Transact-SQL command - was added in SQL Server 2014 SP2. This means that the external file is in UTF-8 but while importing the data there is implicit convert to UTF-16 if we store the data in NCHAR, NVARCHAR, or XML data type.

    >> for early versions of SQL Server, can I insert UTF-8 XML into SQL Server.

    By the way, I usually answer YES for most "can we do" questions! The question should be How and not if we can. 

    Obviously we can, but before version 2014 SP2 we had to first EXPLICITLY CONVERT the file to UTF-16 and then import it, while from 2014 SP2 we can import the file in UTF-8 directly into SQL Server NCHAR/NVARCHAR/XML data types - in this case there will be IMPLICIT CONVERT to UTF-16

    Notice again that the data in the database in this case is in UTF-16 even so the source is in UTF-8

    >> Sample...<?xml version="1.0" encoding="utf-8" ?>

    Method 1 is really problematic since INSERT string value directly to a table is totally different from IMPORT DATA using a utility like bcp or using BULK INSERT. Moreover, you did not provide the DDL so we have no idea what is the column's type.

    If you INSERT the data into NCHAR/NVARCHAR then this is totally not relevant. In this case, you can write in the text encoding="RonenAriely" and this will not change the encoding of the data. This is a simple text which has nothing to do with the encoding of the stored data. In this case the text is inserted as it is.

    In both samples in this case, the data is stored in UTF-16 encoding (or UCS2 in older versions). This is true in sql server 2019 as well.

    If you are using a data types like CHAR/VARCHAR with a COLLATE that does not support UTF-8 (which only exists in 2019) then it will be stored as extended ASCII encoded according to the COLLATE which you use.

    The news in 2019 is that from SQL Server 2019 we have COLLATEs which support UTF-8. If you used one of these COLLATE which support UTF-8 then the data in CHAR/VARCHAR will be stored using UTF-8.

    If you INSERT the data into XML data type then this is much more complex story since XML data type has more limitations and rules. In this case, the server actually parse the text and check the content of the text in any manipulation of the data. The server confirms that the value fits to Microsoft SQL Server's rules regarding this data type, and change the data! The stored data is NOT the same as the INSERTED text.

    For example if the content of the text is not valid XML '<open><x></x>' then you will get an error which give you some information of the issue. In this sample you will get "unexpected end of input" since we have an opening node <open> without a closer.

    If you IMPORT data from a file then the XML declaration must use the same encoding as the file, or no properties for the encoding. If the file is saved using UTF-8 then we need to use encoding="utf-8", and if the file is stored using UTF-16 then you need to use encoding="utf-16".

    When you use INSERT like in your Method 1, then you actually use CHAR/VARCHAR as input (since in your sample you did not used the national character "N" before the text). In this case the server will explicitly convert the data into UTF-16 and change the data according to the internal rules I will mention.

    In this case if the text includes only ASCII characters then and you use encoding="utf-8" then the server will insert the text well

    DECLARE @_XML XML 
    SET @_XMl = 
    '<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
    </NodeA>'
    SELECT @_XML
    GO -- ok

    But if the text includes NON-ASCII characters then you will get error about illegal xml character:

    DECLARE @_XML XML 
    SET @_XMl = 
    '<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- Error

    If you will use encoding="utf-8" and national character "N" then you will get error "unable to switch the encoding", since the input text is now in UTF-16 and since you use XML data type the server parse the text and expect to get text in UTF-8.

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- Error

    Same issue we will get if we will do the opposite: Using UTF-16 for the input (by use the "N" before the text) and inside the content of the file you use encoding = "UTF-8".

    In short, THE INPUT MUST FIT THE CONTENT IF YOU ARE USING XML DATA TYPE

    The right way to INSERT the value directly into XML data type is to use national character "N" and inside the XML content to tell the server that you are using UTF-16, since the input is in UTF-16

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<?xml version = "1.0" encoding = "UTF-16" standalone = "no" ?>
    <NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML
    GO -- OK

     

    >> If yes, then need I always specify the encoding in the XML header explicitly, like this:  <?xml version="1.0" encoding="utf-8" ?>

    I assume that you speak about XML Data type

    The answer is No, but it is recommended.

    If you IMPORT the data then the server can get the input encoding from your command. In any case the server will try to use implicit convert to UTF-16. In SQL Server you do not have to use the XML declaration at all (not only the encoding) and in fact the server remove this part when it store the data.

    DECLARE @_XML XML 
    SET @_XMl = 
    N'<NodeA>
       <EnName>Ronen Ariely</EnName>
       <HebName>רונן אריאלי</HebName>
    </NodeA>'
    SELECT @_XML

    >>  I am not sure about XML data type.

    In general (not sql server specifically), when we use XML then we usually need to support UNICODE but we can use different encoding.

    >> I would guess that you have DB collation compatible with Chinese ?!

    This is not relevant for the INSERT of the data.

    XML data type use UNICODE UTF-16 encoding for any collate you choose. The COLLATE has other impact like sorting the data, comparing values and so on. But in NCHAR, NVARCHAR, and XML data type the stored data is always in UTF-16 (or UCS2 in older versions)

    You can confirm this by using the undocumented commands DBCC IND and DBCC PAGE 

    >> You would need to remove XML prolog with encoding="UTF-8" from the XML.

    Check what I wrote above regarding different cases. If this is am XML file that you import then having the XML declaration is probably needed in the file and you do not need to remove it. Using direct INSERT you simply mad a mistake since your INPUT was not in UTF-8 and therefore the XML declaration which declare that this is UTF-8 was wrong and did not fit the real encoding of the data - check what I wrote above

    >> SQL to deal with the file, encoding utf-16:

    Actually as I explained SQL Server from 2014 SP2 can deal with files in UTF-8 as well.

     - - - - - - - - - - - - - - -  

    Summery!

    • XML DATA Type is stored in UTF-16 encoding (this is not related to the server version, exactly as NVARCHAR is UTF-16)
    • SQL Server 2014 SP2 and above can import files which are in UTF-8 encoding
    • IMPORT from file and INSERT from value does not behave the same
    • If we use XML data type then the input source must fit the content text encoding parameter in the XML declaration
    • XML data type in SQL Server does not store the data "as it is" in the input text! The stored data will not include the "XML declaration" (the server remove this during the insert), the server might change the order of nodes, the server encode some of the characters and will not store them as they are in the text.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]

    • Marked as answer by tempc Saturday, July 20, 2019 11:32 AM
    Thursday, July 11, 2019 2:30 AM
  • Hi Ronen,

    Thanks a lot for all the in depth clarifications.

    Please find below SQL, on SQL Server 2017, that proves that variable of XML data type accepts or denies XML under the following conditions:

    • XML prolog is omitted.
      Accepts both English as well as non-English characters
    • XML prolog specifies encoding="UTF-16".
      Accepts both English as well as non-English character
    • XML prolog specifies encoding="UTF-8", both English or non-English characters.
      Erroring out:
      Msg 9402, Level 16, State 1, Line …
      XML parsing: line 1, character 38, unable to switch the encoding

    I would expect that the last case for the @xml4 variable and XML with the prolog encoding="UTF-8" and English characters should work on SQL Server 2019. It should be able to convert UTF-8 string into UTF-16. Today, if I remove N', it is working.

    DECLARE @tbl TABLE (ID int IDENTITY(1,1), XmlColumn XML);
    
    -- no prolog, non-English characters
    DECLARE @xml XML = N'<sample1>
    	<company>测试</company>
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml);
    
    -- prolog with encoding="UTF-16", non-English characters
    DECLARE @xml2 XML = N'<?xml version="1.0" encoding="UTF-16"?>
    <sample1>
    	<company>测试</company>
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml2);
    
    /*
    -- prolog with encoding="UTF-8", non-English characters
    -- Msg 9402, Level 16, State 1, Line 27
    -- XML parsing: line 1, character 38, unable to switch the encoding
    DECLARE @xml3 XML = N'<?xml version="1.0" encoding="UTF-8"?>
    <sample1>
    	<!--<company>测试</company>-->
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml3);
    */
    
    -- prolog with encoding="UTF-8", English characters only
    -- Msg 9402, Level 16, State 1, Line 44
    -- XML parsing: line 1, character 38, unable to switch the encoding
    DECLARE @xml4 XML = N'<?xml version="1.0" encoding="UTF-8"?>
    <sample1>
    	<company>English</company>
    	<city>Wheaton</city>
    	<year-founded>1998</year-founded>
    	<industry>software</industry>
    </sample1>';
    
    INSERT INTO @tbl (XmlColumn)
    VALUES (@xml4);
    
    SELECT *, CAST(XmlColumn AS NVARCHAR(MAX))
    FROM @tbl;



    Thursday, July 11, 2019 4:35 AM
  • Hi tempc,

    What's the latest on your end?

    Were you able to make a progress with the XML with Chinese characters?

    I hope, Ronen Ariely and I, we provided enough information on the subject.

    Thursday, July 11, 2019 5:05 AM
  • Good day Yitzhak,

    What you do is exactly what I explained. If this is not clear please re-read and I can try to clarify better as well but check it:

    Your first Option (under the "-- no prolog, non-English characters") works well since you use national character "N" and inside the content you do not use any XML declaration!

    therefore, there is no contradiction

    Your second option works well since you use national character "N" and inside the content you do use XML declaration encoding="UTF-16"

    Again, national character data types are encoded with "UTF-16" which mean there is no contradiction. This is the preferred option which I mentioned when we use direct INSERT or variable by the way! Using  explicit XML declaration encoding="UTF-16" and using national character

    I do not see exactly what you added to what I said above😃
    These are two options which I mentioned

     -------------------------------------

    Note! In what you wrote before the samples there are some inaccurate. You must re-read what I wrote before since it is all there!

    >> XML prolog is omitted. Accepts both English as well as non-English characters

    NOT ALWAYS! If you use XML declaration but you do not use national character "N" and you do not use the new 2019 UTF-8 COLLATE, then your input is not UNICODE. Check the example above or bellow and remember (1) that SQL Server explicitly convert the data to UTF-16 when you insert it to XML data type but you cannot restore characters which were lost when the text was not UNICODE. (2) that this can be impacted by your current COLLATE and the characters you use in the input

    >> XML prolog specifies encoding="UTF-16". Accepts both English as well as non-English character

    Same as before!

    If you use encoding="UTF-16" and you use national character, then it will work - this is exactly your second sample and this is the option  I wrote as preferred.

    But if you do not use national character then this will fail! This is the same in any version, and is not related only to non-English character.

    SELECT @@VERSION
    -- Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64)
    
    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-16" standalone = "no" ?>
    <contact-info>
       <name>Tanmay Patil</name>
       <company>TutorialsPoint</company>
       <phone>(011) 123-4567</phone>
    </contact-info>'
    SELECT @_XML
    GO -- Error! XML parsing: line 1, character 62, unable to switch the encoding

    The explanation is simple! You have contradiction between the input which is not UNICODE (assuming I am not using the 2019 UTF-8 COLLATE), and the XML DECLARATION which say that this is UTF-16

    Check this regarding your next claim...

    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-8" ?><Jap>测试</Jap>'
    SELECT @_XML
    GO -- Might lost data and not work well in all languages/collate!
    DECLARE @_XML XML 
    SET @_XMl = '<?xml version = "1.0" encoding = "UTF-8" ?><Heb>א</Heb>'
    SELECT @_XML
    GO -- Might lost data and not work well in all languages/collate!

     

    ... and so on...

    The magic word is contradiction between the XML DECLARATION and the INPUT DATA.


    signature   Ronen Ariely
     [Personal Site]    [Blog]    [Facebook]    [Linkedin]


    • Edited by pituachMVP Thursday, July 11, 2019 12:42 PM
    • Marked as answer by tempc Saturday, July 20, 2019 11:31 AM
    Thursday, July 11, 2019 9:11 AM