none
bulk Load 100 GB (100 million rows) in to SQL Server from Text file

    Question

  • Gurus,

    We are in process of setting up a SSIS package to load a formatted text file in to SQL server. It will have around 100 million rows and file size will be (multiple files of around 15 GB each) 100 GB. The file format is aligned with XML schema like mentioned below... it takes nearly 72 hrs to load this file in to SQL server tables...

    File format

    EM|123|XYZ|30|Sales mgr|20000|AD|1 Street 1| State1|City1|US|AD|12Street 2|state 2|City2|UK|CON|2012689648|CON|42343435

    EM|113|WYZ|31|Sales grade|200|AD|12 Street 1| State2|City2|US|AD|1Street 22|state 3|City 3|UK|CON|201689648|CON|423435

    EM|143|rYZ|32|Sales Egr|2000|AD|113Street 1| State3|City3|US|AD|12Street 21|state 4|City 5|UK|CON|201269648|CON|443435

    Data will come in above format. It means "EM" till "AD" is Employee details like Code,Name,age,Designation,Salary and "AD" is Address details like Street,Sate,City,Country. Address data can be multiple for same employee...similarly "CON" is contact details with Phone number which also may be multiple.

    So, we need to load Employee Details in to seperate table, Address details in seperate table and Contact details in seperate table with Code as Primary key in Employee Details and Reference key in other two tables.

    We designed package like, had a Script Component as Source and parsed line by line by using .NET scripts and created multiple out put buffers each per table and added the row in the script. Mapped the Script component output to 3 OLE DB Destinations (SQL Server tables).

    Our server is Quad Core with 48 GB RAM virtualized and we have 2 cores with 24 GB dedicated for DB. Our SQL server DB (Simple Recovery model) has Data files in Network share location that is SAN storage. To improve performance we created Each table in differenct data file (Primary and secondary).. but still it takes around 72 hrs.

    Need guidance on following points.

    1. Is it possible to use BCP, if yes any pointers.. (Hope BCP will perform better)

    2. Any suggestions on specified solution.

    3. Any alternates...

    Looking forward for response..Any help is much appreciated..

    Thursday, March 01, 2012 5:47 PM

All replies

  • Microsoft released a tool for SSIS called Balanced Data Distributor.  Depending on how the package is architected you may be able to use it to load  and or transform a bit faster;

    http://blogs.msdn.com/b/sqlperf/archive/2011/05/25/the-balanced-data-distributor-for-ssis.aspx

    Thanks

    Brad


    Enterprise Applications Specialist

    Thursday, March 01, 2012 6:43 PM
  • I recommend you to import 100 million rows in stagging table (temp table )first and then move this data into the main tables. it is good that you have multiple files so you can make your task parallel to import all these file parallely into the same destination which finish quickly. you need to split your data loading task on the basis of your cores on server.

    after data is loaded in stagging then move from there to your main tables using parallel tasks.

    try to do in this way it's will imporve performance.

    cheers,

    Zaim Raza.

    Friday, March 02, 2012 11:55 AM
  • For such circumstances 
    of ETL project for huge data entity , we have 2 probable tiers :

    • ETL processes based on
      either DTS (Data transformation Service) of SSIS (SQL Server Integration
      Service) or through direct Import& Export functionality of DB Service.
    • Bulk Merge commands of
      2008
      which you could establish more faster & scalable  data warehousing solution for archiving
      purposes ..Please have a look at my article for this regard:

    http://www.sqlserverpath.org/blog/2012/02/16/data-warehousing-workshop-44/

    Moreover , please bear in mind that all such tiers
    represent  some kind of OLTP transactions
    where they need for the smallest index sizes to reduce their IO cost while any
    of these OLTP transactions and thereby we could boost significantly their
    performance

    Have a look on indexing regard at my article below to
    help you better for boosting performance of your ETL solutions :


    http://www.sqlserverpath.org/blog/2012/01/09/towards-t-sql-queries-of-0-sec-sixth-part/

    http://www.sqlserverpath.org/blog/2012/02/06/data-warehousing-workshop-14/



    Think more deeply of performance terms

    Friday, March 02, 2012 2:45 PM
  • Hi sarran,

    "...Source and parsed line by line by using .NET scripts..."

    This is the most resource consuming process in your package. So, it takes much hours.

    Since these files are formatted, why not directly import these data into sql server tables with fast load option? It should be easily map output columns from these text files sources to the columns of sql server tables.

    When creating ETL process, best to avoid transform with handled by line by line or rows by rows.

    In additon, the source connection should the type of MULTIFLATFILE which can be found by right click the blank area of connection managers and select New Connection to open Add SSIS Connection Manager window, in the window, you can find the MULTIFLATFILE connection type.

    Regards,
    Jerry

    Monday, March 12, 2012 7:12 AM
    Moderator
  • Hi,

    We designed package like, had a Script Component as Source and parsed line by line by using .NET scripts and created multiple out put buffers each per table and added the row in the script. Mapped the Script component output to 3 OLE DB Destinations (SQL Server tables).

    What's the purpose of designing script Component is there any constraints..? And why row by row process in dot net..?

    Method 1 :

    Here you can use 'fast load' to load data to SQL Server. And you can also use Balanced Data Distributer (BDD) component before OLEDB Destination for each destination

    Method 2 :

    1.Split the large file into different row sets (Say 1,00,000 records)

    2. Assuming 50 sets of files of each 1,00,000 records. Create a table in sql server which holds the information about FileName,FilePath,Status(Whether its got processed or not)

    Package design steps :

    1.Using Foreach Loop Containes load the filenames,filepath and initial status as unprocessed to Status Table

    2.Using For Loop take one of the file name from table and load that file using dataflow task as shown in above pic.

    3.Once load is complete make status table of perticular row as processed.

    Here u can increase the performance by sheduling the package multiple times in SQL Server Agent

    • Proposed as answer by Yadav Karan Tuesday, March 13, 2012 5:39 AM
    • Unproposed as answer by sarran Wednesday, March 21, 2012 6:32 PM
    Tuesday, March 13, 2012 5:37 AM
  • Hi,

    thanks for your response. But as i have told in my input data "EM" section has fixed number of columns, which i can load directly.. but where as "AD" section may be 0 or n. means i might get an entry for it or i can get up to n times address.. so if i don't have script component how shall i load this "AD" rows in to its designated table..

    Wednesday, March 21, 2012 6:35 PM