none
Cannot bulk insert XML data into a table RRS feed

  • Question

  • Hi,

    I create a database called "XMLDB", then create a table called "XMLTable", then in the table, create a column called "XMLCol", which is of xml type.

    Then I try to bulk insert XML data into the table, with the following query:

    USE XMLDB;

    BULK INSERT [XMLTable] FROM 'D:\Data.dat' WITH (DATAFILETYPE = 'widechar', FORMATFILE = 'D:\Data.XML', MAXERRORS = 2147483647, ROWS_PER_BATCH = 12, TABLOCK);

    After executing the query, I get the following error:

    Msg 9455, Level 16, State 1, Line 3
    XML parsing: line 6735, character 7, illegal qualified name character

    I try to go to line 6735 in Data.dat, and find the following line:

    <w:t>QNAMEDEF-TOKEN  =  %xEF</w:t></w:r></w:p>

    The 7th character is 'N', which should be valid. Therefore, I cannot find the reason why SQL Server will report the error.

    The data.dat file can be downloaded from:

    https://www.dropbox.com/s/49kkt7dqng4mol5/Data.DAT?dl=0   

    The data.xml file can be downloaded from:

    https://www.dropbox.com/s/zxs3x99v2er0wa1/Data.XML?dl=0



    Sunday, June 30, 2019 3:16 AM

Answers

  • Ok, so I did take a look at that document. Line 6735 reads:

    <w:t><1></w:t></w:r>

    And that seventh character is not a lowercase L. It is the digit 1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Sunday, June 30, 2019 8:17 PM
  • Hi tempc,

    Based on the Erland Sommarskog proposal, please run the following code, and you will see what XML documents are not well-formed:

    USE tempdb;
    GO
    
    DROP TABLE IF EXISTS XMLTable;
    
    CREATE TABLE XMLTable(XMLData NVARCHAR(MAX));
    
    BULK INSERT [XMLTable]
    FROM 'e:\Temp\XMLBulkINSERT\Data.DAT'
    WITH
    (
        DATAFILETYPE = 'widechar',
        FORMATFILE = 'e:\Temp\XMLBulkINSERT\Data_FORMATFILE.xml',
        MAXERRORS = 2147483647,
        ROWS_PER_BATCH = 12,
        TABLOCK
    );
    
    SELECT *
    	, TRY_CAST(XMLData AS XML) AS RealXML
    FROM dbo.XMLTable;
    
    

    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 2:24 AM
  • Hi tempc,

    TRY_CAST() is supported by SQL Server 2012 and later.

    Please see the outcome of the SQL below. The RealXML column NULL value shows which XML files are not well-formed.



    • Edited by Yitzhak Khabinsky Monday, July 1, 2019 3:00 AM
    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 2:58 AM
  • Hi tempc,

    My SQL Server version is 2017.

    You are very close. You need to check the entire SQL script I provided earlier. It has 3 parts:

    1. DDL
    2. BULK INSERT
    3. SELECT...

    The XMLData column data type is NVARCHAR(MAX). That datatype accepts anything. It should work on your 2008 also. I am using the TRY_CAST() function on the fly just to see what XML is well formed as well as what is not well-formed.

    • Edited by Yitzhak Khabinsky Monday, July 1, 2019 3:24 AM
    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 3:22 AM
  • Hi tempc,

    You can use the CAST() function instead. It will process first 2 XML documents correctly, and will fail on the 3rd XML document with the one error at a time. So you would know what XML document is not well-formed.

    • Marked as answer by tempc Monday, July 1, 2019 8:41 AM
    Monday, July 1, 2019 4:08 AM
  • Thank you very much. I just wonder how to find the line 6735 from data.dat. Since data.dat contains multiple XML document, while SQL Server does not said in which XML document it encounters the error. Do I have to copy each XML document out the file one by one and check line 6735(if existing) for each XML document one by one?

    You are making the assumption that SQL Server has a full understanding of the operation it is performing, and therefore will give you a perfectly reasonable error message. Please keep in mind that SQL Server is a computer program and not a human. :-)

    BULK INSERT reads a binary stream of data and does not care what it is. For every record it identifies, it hands it over the the Query Processeor to insert the row. The QP does not know where the data comes from. All it sees is a single XML document, and then it realises that at line 6735 in the document there is an error.

    If you are on SQL 2008 R2 you should consider upgrading, since it is about to go out of support this month - or was that in June? In lieu of TRY_CAST (or TRY_CONVERT), you will need to run a cursor over the table and attempt conversion to xml one by one, and trap the errors with TRY-CATCH.
    A tip is to add an IDENTITY column to the import table - that makes it easier to find the rows:

    CREATE TABLE XMLData(XML nvarchar(MAX), id int IDENTITY PRIMARY KEY)

    You don't need any modification of the format file to do this.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by tempc Monday, July 1, 2019 8:41 AM
    Monday, July 1, 2019 8:30 AM

All replies

  • I don't have the time to look into this (or more precisely, the weather is too gorgeous to stay indoors!) in detail, but line 6735 is not a line in your file; it is a line number in an XML document, and I see that your file has multiple XML documents.
    I suspect that there is something flaky with the line terminators, so BULK INSERT keeps reading until things break down.

    I would recommend that you read the file into a table with nvarchar(MAX) as the data type. This will give you a couple of rows. Then you can do

    SELECT xml FROM nvarchartbl WHERE try_cast(xml AS xml) IS NULL

    to find the bad ones.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, June 30, 2019 9:38 AM
  • Hi tempc,

    I downloaded your Data.dat file. Which is supposedly a file in XML format.

    As Erland Sommarskog already pointed out, this file is not a well-formed XML file.

    The file in question contains many XML documents lumped together (concatenated) as a single file. Some of them are in English, some in Chinese.

    The file contains many XML prologs as below, and it goes on and on. 13 prologs total.

    <?xml version="1.0"?>
    <catalog>
    …
    </catalog>
    <?xml version="1.0"?>
    <bookstore>
    …
    </bookstore>
    <?xml version="1.0"?>
    <w:document 
    …
    </w:document 

    Generally speaking, each prolog section means a new XML document, and a need to have it as a separate XML file.

    You need to work at the source to generate/request proper well-formed XML files.



    Sunday, June 30, 2019 2:51 PM
  • Generally speaking, each prolog section means a new XML document, and need to have it as a separate XML file.

    There is no problem with that as such. BULK INSERT reads a stream of bytes and stops when it finds the field terminator which in this case is a suite of Chinese characters.

    However, each XML document must be correct and use the same encoding.

    Being back from a lovely day at one of the lakes in the Stockholm area, I loaded the file into a nvarchar(MAX) column as I said in my previous post. BULK INSERT finds in total 12 records. Of these, try_cast returns NULL for four of them. I tried to look at the first of the failed ones, which is the one that produced the error message in the first post. But when I realised that SSMS had not returned all characters from that document, I abandoned my analysis.

    I noted that there were some random Chinese characters in that document. I don't know if that is to be expected, or whether it may indicate a data damage of some sort.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, June 30, 2019 8:02 PM
  • Ok, so I did take a look at that document. Line 6735 reads:

    <w:t><1></w:t></w:r>

    And that seventh character is not a lowercase L. It is the digit 1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Sunday, June 30, 2019 8:17 PM
  • Hi Erland,

    "...There is no problem with that as such. BULK INSERT reads a stream of bytes and stops when it finds the field terminator...BULK INSERT finds in total 12 records..."

    Wow. Thanks for the clarification. I didn't know about it.

    Could you please elaborate what would be the field terminator for this case, i.e. lumped together XML files/chunks.


    Sunday, June 30, 2019 8:19 PM
  • In this particular case, the field terminator is 记录终止符. Don't ask me where it comes from. But it could be any ten-byte string. But preferably the terminator should be a string that cannot appear in the data. (Because that would cause BULK INSERT to misinterpret the file.)

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Sunday, June 30, 2019 9:09 PM
  • In this particular case, the field terminator is 记录终止符. Don't ask me where it comes from. But it could be any ten-byte string. But preferably the terminator should be a string that cannot appear in the data. (Because that would cause BULK INSERT to misinterpret the file.)

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Yes, 记录终止符 is 10 bytes terminator selected by me, since it is long enough. I don't think it will appear in the normal record contents. Therefore, I choose it as the terminator.
    Sunday, June 30, 2019 11:49 PM
  • Ok, so I did take a look at that document. Line 6735 reads:

    <w:t><1></w:t></w:r>

    And that seventh character is not a lowercase L. It is the digit 1.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Hi,

    Thank you very much. I just wonder how to find the line 6735 from data.dat. Since data.dat contains multiple XML document, while SQL Server does not said in which XML document it encounters the error. Do I have to copy each XML document out the file one by one and check line 6735(if existing) for each XML document one by one?

    Sunday, June 30, 2019 11:55 PM
  • Hi tempc,

    Based on the Erland Sommarskog proposal, please run the following code, and you will see what XML documents are not well-formed:

    USE tempdb;
    GO
    
    DROP TABLE IF EXISTS XMLTable;
    
    CREATE TABLE XMLTable(XMLData NVARCHAR(MAX));
    
    BULK INSERT [XMLTable]
    FROM 'e:\Temp\XMLBulkINSERT\Data.DAT'
    WITH
    (
        DATAFILETYPE = 'widechar',
        FORMATFILE = 'e:\Temp\XMLBulkINSERT\Data_FORMATFILE.xml',
        MAXERRORS = 2147483647,
        ROWS_PER_BATCH = 12,
        TABLOCK
    );
    
    SELECT *
    	, TRY_CAST(XMLData AS XML) AS RealXML
    FROM dbo.XMLTable;
    
    

    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 2:24 AM
  • Hi, Yitzhak

    Thanks but in your code snippet, BULK INSERT statement will report the same error as I see before, which also not mention in which XML file the error occurs.

    Also it seems TRY_CAST is not supported by SQL Server 2008 R2?

    Thanks

    Monday, July 1, 2019 2:42 AM
  • Hi tempc,

    TRY_CAST() is supported by SQL Server 2012 and later.

    Please see the outcome of the SQL below. The RealXML column NULL value shows which XML files are not well-formed.



    • Edited by Yitzhak Khabinsky Monday, July 1, 2019 3:00 AM
    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 2:58 AM
  • Hi,

    I get your point.

    It seems that in my SQL Server 2008, if one of the XML document contains error, then BULK INSERT will fail the whole operation. But from your SQL Server 2012's result, it seems not be the case, BULK INSERT will ignore the XML document with errors and continue with the well-formed XML documents, which makes it possible to see which XML document has errors.

    Monday, July 1, 2019 3:11 AM
  • Hi tempc,

    My SQL Server version is 2017.

    You are very close. You need to check the entire SQL script I provided earlier. It has 3 parts:

    1. DDL
    2. BULK INSERT
    3. SELECT...

    The XMLData column data type is NVARCHAR(MAX). That datatype accepts anything. It should work on your 2008 also. I am using the TRY_CAST() function on the fly just to see what XML is well formed as well as what is not well-formed.

    • Edited by Yitzhak Khabinsky Monday, July 1, 2019 3:24 AM
    • Marked as answer by tempc Monday, July 1, 2019 8:40 AM
    Monday, July 1, 2019 3:22 AM
  • Hi,

    OK. I see. I just overlook 

    XMLData NVARCHAR(MAX)

    Now the problem is, is there an equivalent function in SQL Server 2008 that works like TRY_CAST()?


    Monday, July 1, 2019 3:51 AM
  • Hi tempc,

    You can use the CAST() function instead. It will process first 2 XML documents correctly, and will fail on the 3rd XML document with the one error at a time. So you would know what XML document is not well-formed.

    • Marked as answer by tempc Monday, July 1, 2019 8:41 AM
    Monday, July 1, 2019 4:08 AM
  • Thank you very much. I just wonder how to find the line 6735 from data.dat. Since data.dat contains multiple XML document, while SQL Server does not said in which XML document it encounters the error. Do I have to copy each XML document out the file one by one and check line 6735(if existing) for each XML document one by one?

    You are making the assumption that SQL Server has a full understanding of the operation it is performing, and therefore will give you a perfectly reasonable error message. Please keep in mind that SQL Server is a computer program and not a human. :-)

    BULK INSERT reads a binary stream of data and does not care what it is. For every record it identifies, it hands it over the the Query Processeor to insert the row. The QP does not know where the data comes from. All it sees is a single XML document, and then it realises that at line 6735 in the document there is an error.

    If you are on SQL 2008 R2 you should consider upgrading, since it is about to go out of support this month - or was that in June? In lieu of TRY_CAST (or TRY_CONVERT), you will need to run a cursor over the table and attempt conversion to xml one by one, and trap the errors with TRY-CATCH.
    A tip is to add an IDENTITY column to the import table - that makes it easier to find the rows:

    CREATE TABLE XMLData(XML nvarchar(MAX), id int IDENTITY PRIMARY KEY)

    You don't need any modification of the format file to do this.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    • Marked as answer by tempc Monday, July 1, 2019 8:41 AM
    Monday, July 1, 2019 8:30 AM
  • Hi, Erland and Yitzhak

    Thank you very much for all your helps

    Monday, July 1, 2019 8:41 AM