Using Bulk Insert to Import Inconsistent Data Format (Using Pure T-SQL)

Using Bulk Insert to Import Inconsistent Data Format (Using Pure T-SQL)

Introduction

Some third-party applications produce reports as CSV files where each line is in a different format. These applications use an export format that parses each record on-the-fly, and each value in the record separately. The format for each value is set by the value itself. 

For example, if there is data in the field, then it will export the data inside a quotation mark, but if there is no data then the data will be blank and without a quotation mark. Moreover some applications do not use 'data type' when generating a report. If there is data and the data is numeric, then some applications might not use any quotation marks, and in the same field on a different record data that is not numeric will be exported with the quotation marks. We can imagine a single CSV file with a specific column exported in 6 different formats.

In order to use bulk insert directly we have to make sure that all the data is consistent with one format.

The problem

We need to use bulk insert to import data from a CSV file into the SQL server database using pure T-SQL. A Bulk insert operation can use only one format file and our metadata must remain consistent with it. The first step using bulk insert is to find a set of "bulk insert" rules (like: End Of Line, End Of Column, Collation…) that fit all of the data. This is not the case sometimes as mentioned above.

If you got the answer in the forum that this can't be done using pure T-SQL then remember that I always say "never say never".

* In this article we are going to talk only about a pure T-SQL solution, as there are several solutions (such as using SSIS, CLR or a third party app) that can do it in different ways; sometimes in a better way.

Our Case Study

Our application exports the data as a CSV Comma Delimited file without a consistent format. In this example we will deal with a very common situation that fits these rules:

1. Our application use column type (just to make it easier for this article we will focus on a string column). So a numeric column will never use quotation marks and a string column will use quotation marks on and off by these rules.

2. If there is data in the field then it will export the data inside quotation marks (no matter if the data is numeric or not, as the column is string type)

3. If there is no data then the data will be blank and without quotation marks. 

The original sample data that we use looks like this:

ID Phone Number First Name Last Name
1 9999999 ronen Ariely
2 8888888 xxx1, xxx2 yyy
2 8888888 xxx1, xxx2
3

yyy
4 7777777


2222222 zzz kkk
5 1111111 5000.5

5

According to the application export rules above, our CSV file looks like this:

1,9999999,"ronen","ariely"
2,8888888,"xxx1,xxx2",yyy
2,8888888,"xxx1,xxx2",
3,,,"yyy"
4,7777777,,
,2222222,zzz,kkk
5,1111111,"5000.5","5"

* we can see in the last line that our application uses column type, so even when our value is numeric it will be inside quotation marks. But we have to remember that there are some more complex situations, like applications that do not use column type. Then the last line can look like:  [5,5000.5,5]. And it can be more complex if the culture formats numbers similar to 5,000.5. Then our CSV line might look like this [5,5,000.5,5] 

The solution:

* Remember that this is only a workaround for our specific case. For each set of data a slightly different solution might fit. The idea of how to get to the solution is what is important here.

STEP 1: Identify the import file format

In this step we will run several test with different format files. Our aim is to identify any potential problems and to find the best format file which will fit as many columns as we can from the start.

Finding the problematic columns and the consistent column format

First of all you have to find a record format that fits most of the data, as well as the columns that might have be in-consistent with this format. In order to do that we are going to run several tests and then we will implement the conclusion at the next step. We will start with a simple bulk insert and continue with some more complex formats. Using the ERROR messages and the results, we will identify the potential problems. 

Let's try this in practice

Open Notepad and copy our CSV data into the file.

1,9999999,"ronen","ariely"
2,8888888,"xxx1,xxx2",yyy
2,8888888,"xxx1,xxx2",
3,,,"yyy"
4,7777777,,
,2222222,zzz,kkk
5,1111111,"5000.5","5"

Save the file as "C:\ArielyBulkInsertTesting\Test01.csv"

* make sure that you use ANSI format when you save the file (you can use a different format like UNICODE but for this example we shall use ANSI).

Open Notepad and copy our XML format data into the file.

* Using a file format can help for more complex formats. I highly recommended always to use a file format.

<xml version="1.0"?>
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
      <FIELD ID="1" xsi:type="CharTerm"                                                                                                  TERMINATOR=","        />
      <FIELD ID="2" xsi:type="CharTerm"                                                             MAX_LENGTH="7"    TERMINATOR=","        />
      <FIELD ID="3" xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15"  TERMINATOR=","        />
      <FIELD ID="4" xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15"  TERMINATOR="\r\n" />
    </RECORD>
    <ROW>
      <COLUMN  SOURCE="1"  NAME="ID"           xsi:type="SQLINT"/>
      <COLUMN  SOURCE="2"  NAME="PhoneNumber"  xsi:type="SQLINT"/>
      <COLUMN  SOURCE="3"  NAME="FirstName"    xsi:type="SQLNVARCHAR"/>
      <COLUMN  SOURCE="4"  NAME="LastName"     xsi:type="SQLNVARCHAR"/>
    </ROW>
  </BCPFORMAT>

Save the file as "C:\ArielyBulkInsertTesting\Test01.xml"

Open SSMS and run this DDL to create our table:

CREATE TABLE #test (
    ID int
    , PhoneNumber int
    , FirstName varchar(15)
    , LastName varchar(15)
)
GO

Try to use this simple bulk insert query to import our data:

BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
    FORMATFILE='C:\ArielyBulkInsertTesting\Test01.xml'
    , MAXERRORS = 1
    , KEEPNULLS
    --, DATAFILETYPE = 'native'
    --, CODEPAGE='RAW'
    --, ROWTERMINATOR='\r\n'
    --, FIRSTROW = 3
);
GO

No error… have you got a good feeling? Let's check our data

select * from #test
GO

Results:

ID PhoneNumber FirstName LastName
1 9999999 "ronen" "ariely"
2 8888888 "xxx1 xxx2,"yyy"
2 8888888 "xxx1 xxx2",
3 NULL NULL "yyy"
4 7777777 NULL NULL
NULL 2222222 "zzz kkk"
5 1111111 "5000.5"

"5"

Compare our results to the original data... Ops… that's bad… 

In our case we can see that the first and second columns have no problem, but the problems start on the third column and continue to fourth column. First of all we have some quotation marks in the results. Moreover the third column was split in several records and part of the data moved into the fourth column. Actually, as our format file says that the third column ends on the comma, then every time we have a comma as part of the string data it will be split. That make sense. 

When we have string data we surround the content in quotes. If our data had "a consistent format" then all string data would be enclosed in quotes, even empty data. 

Let's demonstrate a well formatted data CSV. Save this data as as "C:\ArielyBulkInsertTesting\Test02.csv"

1,9999999,"ronen","ariely"
2,8888888,"xxx1,xxx2","yyy"
2,8888888,"xxx1,xxx2",""
3,,"","yyy"
4,7777777,"",""
,2222222,"zzz","kkk"
5,1111111,"5000.5","5"

In that case the solution was very simple. We could use this format file:

<xml version="1.0"?>
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance>
    <RECORD>
      <FIELD ID="1"  xsi:type="CharTerm"                                                                                                 TERMINATOR=','             />
      <FIELD ID="2"  xsi:type="CharTerm"                                                             MAX_LENGTH="7"    TERMINATOR=',\"'        />
      <FIELD ID="3"  xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15"  TERMINATOR='\",\"'   />
      <FIELD ID="4"  xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15"  TERMINATOR='\"\r\n' />
    </RECORD>
    <ROW>
      <COLUMN SOURCE="1"  NAME="ID"                        xsi:type="SQLINT"      />
      <COLUMN SOURCE="2"  NAME="PhoneNumber"   xsi:type="SQLINT"      />
      <COLUMN SOURCE="3"  NAME="FirstName"       xsi:type="SQLNVARCHAR" />
      <COLUMN SOURCE="4"  NAME="LastName"         xsi:type="SQLNVARCHAR" />
    </ROW>
  </BCPFORMAT>

Save the file as "C:\ArielyBulkInsertTesting\Test02.xml"

Clear our table of previous data:

truncate table #test
GO

Now execute the bulk insert and the data should be placed in the table correctly. If our data was formatted in this way (with consistent format) then we would not need this article :-)

BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test02.csv'
WITH (
    FORMATFILE='C:\ArielyBulkInsertTesting\Test02.xml'
    , MAXERRORS = 1
    --, KEEPNULLS
    --, DATAFILETYPE = 'native'
    --, CODEPAGE='RAW'
    --, ROWTERMINATOR='\r\n'
    --, FIRSTROW = 3
);
GO

Let's continue to work on our "real" data! Clear the data

truncate table #test
GO

In some cases we might build a format file which bring us error messages. We already know that the data will not fit all records. This test will give us more info using the error message. Try to use this format file (C:\ArielyBulkInsertTesting\Test03.xml):

<xml version="1.0"?>
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
      <FIELD ID="1" xsi:type="CharTerm"                                                                                                TERMINATOR=","         />
      <FIELD ID="2" xsi:type="CharTerm"                                                             MAX_LENGTH="7"   TERMINATOR=',\"'    />
      <FIELD ID="3" xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15" TERMINATOR=','        />
      <FIELD ID="4" xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS"  MAX_LENGTH="15" TERMINATOR='\r\n' />
    </RECORD>
    <ROW>
      <COLUMN SOURCE="1" NAME="ID"          xsi:type="SQLINT"/>
      <COLUMN SOURCE="2" NAME="PhoneNumber" xsi:type="SQLINT"/>
      <COLUMN SOURCE="3" NAME="FirstName"   xsi:type="SQLNVARCHAR"/>
      <COLUMN SOURCE="4" NAME="LastName"    xsi:type="SQLNVARCHAR"/>
    </ROW>
  </BCPFORMAT>

Execute our bulk insert

BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
    FORMATFILE='C:\ArielyBulkInsertTesting\Test03.xml'
    , MAXERRORS = 10
    --, KEEPNULLS
    --, DATAFILETYPE = 'native'
    --, CODEPAGE='RAW'
    --, ROWTERMINATOR='\r\n'
    --, FIRSTROW = 3
);
GO

We get an error message which can help us a lot in this case:

Msg 4864, Level 16, State 1, Line 1 Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 4, column 2 (PhoneNumber).

Moreover we can see that the rest of the records in our data were inserted (using SQL 2012). In some cases using data with a small amount of inconsistent records this can be the best way, as most of the data is inserted. Now we can just check which records do not exist in the table and fix it. The error message includes the number of the problematic first row. 

In conclusion the best format file we found succeeds in inserting the first and second columns without any problem. We recognized that the problems start on the third column.


STEP 2: insert the data into temporary table

This is the main step, as now we can use bulk insert to import the data into SQL Server. Since we found that our data does not have a consistent format, we are going to use a temporary table to import the data.

* We don’t have to use a temporary table, as we can just use OPENROWSET to get the data and do the parsing on-the-fly. I will show this in step 3.

The basic idea is to bring all the data before the problematic point (in our case the first and second columns) into separate columns, as they should appear in the final table. Then the rest of the data from the problematic point to the end of the problematic point (or to the end of line if there is no other way) into one column. So in our case the third and fourth columns will be imported as one column.

Let's do it. We will use this format file (save as C:\ArielyBulkInsertTesting\Test04.xml), which is similar to "Test01.xml" file, without the third column:

<xml version="1.0"?>
             xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <RECORD>
      <FIELD ID="1" xsi:type="CharTerm"                                                                                                TERMINATOR=","        />
      <FIELD ID="2" xsi:type="CharTerm"                                                            MAX_LENGTH="7"   TERMINATOR=','         />
      <FIELD ID="4" xsi:type="CharTerm"   COLLATION="Hebrew_CI_AS" MAX_LENGTH="30" TERMINATOR='\r\n' />
    </RECORD>
    <ROW>
      <COLUMN SOURCE="1" NAME="ID"                                          xsi:type="SQLINT"/>
      <COLUMN SOURCE="2" NAME="PhoneNumber"                     xsi:type="SQLINT"/>
      <COLUMN SOURCE="4" NAME="FirstName_LastName"    xsi:type="SQLNVARCHAR"/>
    </ROW>
  </BCPFORMAT>

And execute this query (drop old table, create new table with 3 columns, bulk insert data, select and show the data):

DROP TABLE #test
GO
  
CREATE TABLE #test (
    ID int
    , PhoneNumber int
    , FirstName_LastName varchar(30)
)
GO
  
BULK INSERT #test FROM 'C:\ArielyBulkInsertTesting\Test01.csv'
WITH (
    FORMATFILE='C:\ArielyBulkInsertTesting\Test04.xml'
    , MAXERRORS = 10
    --, KEEPNULLS
    --, DATAFILETYPE = 'native'
    --, CODEPAGE='RAW'
    --, ROWTERMINATOR='\r\n'
    --, FIRSTROW = 3
);
GO
  
select * from #test
GO

our results look like this:

ID            PhoneNumber FirstName_LastName
----------- -----------              ------------------------------
1             9999999             "ronen","ariely"
2             8888888             "xxx1,xxx2","yyy"
2             8888888             "xxx1,xxx2",
3             NULL                   ,"yyy"
4             7777777             ,
NULL     2222222             "zzz","kkk"
5             1111111             "5000.5","5"


The goal of this article is to give an optimal use of pure T-SQL in order to import data which is not well formatted into the database, in an appropriate and effective structure for parsing(step 3). I will not elaborate on step 3 of parsing the data. There can be hundreds of different ways to do this for each case.

I'll show some sample solutions in step 3. Those solutions are not necessarily optimal parsing solutions, but represent solutions in which I use different string functions for parsing our data. Usually when parsing data in SQL Server it is best to use CLR functions.

Step 3: parsing the data into the final table

Now that we imported the data, all that we need to do is parse the last column. These queries can do the job in our case:

select
    ID, PhoneNumber , FirstName_LastName
    , FN = case
        when CHARINDEX('",', FirstName_LastName, 1) > 0
        then LEFT (
                RIGHT(FirstName_LastName, LEN(FirstName_LastName) - 1)
                , CHARINDEX('",', FirstName_LastName, 1) - 2
            )
        else ''
    END
    , LN = case
        when CHARINDEX(',"', FirstName_LastName, 1) > 0
        then SUBSTRING(
            FirstName_LastName
            , CHARINDEX(',"', FirstName_LastName, 1) + 2
            , LEN(FirstName_LastName) - CHARINDEX(',"', FirstName_LastName, 1)  - 2 )
        else ''
    END
from #test
go
  
-- i use @ char but you should use any combination of chars that cannot be in the data value!
-- i can clean in one time all " char as i know it is not part of my data
select ID, PhoneNumber , FirstName_LastName
    , SUBSTRING(Temp, 0, charindex('@',Temp) ) FN
    , SUBSTRING(Temp, charindex('@',Temp) + 1, LEN(Temp) - charindex('@',Temp)) LN
from (
    select
        ID, PhoneNumber , FirstName_LastName
        , Temp = REPLACE(REPLACE(REPLACE(REPLACE (FirstName_LastName, '","', '@'), '",','@'),',"','@'),'"','')
    from #test
) T
go

After we found a way to parse the data, we can use a simple SELECT INTO query to move the data from the temporary table to the final table.

Usually if this is not a onetime operation then I prefer to use one query do it all without declaring a temporary table. I do need these steps to find my Bulk Insert query & format (step 1+2) and to find the parsing function (step 3). Next I convert my queries into an OPENROWSET import query like this (in our case study)

--FINAL TABLE
CREATE TABLE #FINAL (
    ID int
    , PhoneNumber int
    , FirstName varchar(15)
    , LastName varchar(15)
)
GO
  
insert #FINAL
select
    ID, PhoneNumber --, FirstName_LastName
    , FN = case
        when CHARINDEX('",', FirstName_LastName, 1) > 0
        then LEFT (
                RIGHT(FirstName_LastName, LEN(FirstName_LastName) - 1)
                , CHARINDEX('",', FirstName_LastName, 1) - 2
            )
        else ''
    END
    , LN = case
        when CHARINDEX(',"', FirstName_LastName, 1) > 0
        then SUBSTRING(
            FirstName_LastName
            , CHARINDEX(',"', FirstName_LastName, 1) + 2
            , LEN(FirstName_LastName) - CHARINDEX(',"', FirstName_LastName, 1)  - 2 )
        else ''
    END
FROM OPENROWSET(
    BULK N'C:\ArielyBulkInsertTesting\Test01.csv'
    , FORMATFILE = 'C:\ArielyBulkInsertTesting\Test04.xml'
) a
GO
  
select * from #FINAL
GO


Summary

The basic idea is to bring all the data in the problematic columns (or until the end of line if there is no other way) into one column. We can use a temporary table to store the data. Then we can parse the temporary column using any way that suits us. We can use T-SQL functions or CLR functions like SPLIT. We can clean some characters using replace. We can find characters using CHARINDEX, and so on. This is all depends on your specific data. It has nothing to do with bulk insert anymore :-)

We must separate the operation into two parts:

1. Insert the data using bulk insert into the data base (temporary table or using OPENROWSET) in such way that we will be able to use it for step two

2. Parsing and splitting the text on the last column into the final columns

* This article elaborates step 1.

Comments

* A more complex case study in which I used this logic can be seen in the MSDN forum in this link: 
http://social.msdn.microsoft.com/Forums/en-US/5aab602e-1c6b-4316-9b7e-1b89d6c3aebf/bulk-insert-help-needed

* Usually it is much better to do the parsing using CLR functions. If you are not convinced by my recommendation then you can check this link: http://www.sqlperformance.com/2012/07/t-sql-queries/split-strings

* If you can export the file in a consistent format fit with bulk insert than you should do it! This is only a workaround solution.

* If you can build a well formatted import file in advance, from the original import file, using a small application which will format a new file, then do it! This is a much better solution as most languages do a better job of parsing text than SQL Server (T-SQL).

* If you can manage the order of the columns during the exporting, then try to make sure that you move all the problematic columns to the end. This will help us to use the bulk insert in a more optimal way as we will need to parse fewer columns in step 3

* Why not import all the data into one column in a temp table instead of STEP 1 & STEP 2?

This is always an option but probably not a good one. In our case study we use a very simple table structure with 4 columns and only 7 records, but in real life we might get a table with 20 columns or more and several million records. If we have 2 columns (out of 20 columns) with potential problems and we can order the columns so those columns come last, than we can import the most of the data (18 columns) into the final data structure, and we will need to import only the last two columns into one column for parsing. It is much better to separate the data into as many columns as we can and minimize the use of parsing. Parsing is a CPU intensive operation. Parsing the data after importing will probably take longer.

When you have to use complex parsing it is much better to use CLR solutions. As I mention in the start this is a pure T-SQL solution.

Resources

* This article is based on several forum questions (more than 15 which I found using google, and I checked only the first several pages of search results) that remained unanswered for too long. I did not find any solutions or answers except my own based on this logic. This is a very easy solution but we have to think outside the box to get it :-) 

There are no other references for this solution that I know of and most forum questions that I found where closed without an answer, or by sending the questioner to a different solution like using SSIS, or a third party application, or by saying that it cannot be done using bulk insert and pure t-SQL.

Some Forum questions:

http://stackoverflow.com/questions/17552185/format-fields-during-bulk-insert-sql-2008

http://social.msdn.microsoft.com/Forums/en-US/5aab602e-1c6b-4316-9b7e-1b89d6c3aebf/bulk-insert-help-needed

This article is based on Ronen Ariely's blog at
http://ariely.info/Blog/tabid/83/EntryId/122/Using-Bulk-Insert-to-import-inconsistent-data-format-using-pure-T-SQL.aspx
Sort by: Published Date | Most Recent | Most Useful
Comments
  • Hard to follow article - too long and has problems with English which hard to correct.

  • This is the problem with a long theoretical article writing in English when it is not a native language :-). If you have any suggestions I'd love to hear. i will try to improve this, but i am not sure how right now.

    thanks for the feedback Naomi

  • Maheshkumar S Tiwari,

    thank you for the feedback under the history tab :-)

    i don't know why part of the comment are under the history tab and part of them are here (i think it should be all in one place). i do know on any related article, and you most welcome to add if you see any.

  • "i do know on any related article"

    should be

    "i do not know on any related article"

  • Pituach, the comments on the History tab are for when you're making an edit, so that you can explain what edit you're making. These comments are more to ask questions and discuss the article.

  • Ok Ed, That makes sense :-)

    Thanks for the explanation

  • Ok Ed, That makes sense :-)

    Thanks for the explanation

  • Richard Mueller,

    I must say well done. Nice job and thanks for the proofreading :-)

    This is very important in the event that the article was not written by a native English. Definitely looks more readable now.

  • Richard Mueller,

    I want pass your grammar corrections to my original blog. I'll add your name as someone who contributed to the blog. If you have a link on your website or your blog I would love to put a link. Please provide me the link in this case.

  • "on your website" ---> to your website

Page 1 of 2 (14 items) 12