none
SSIS - Import Multiple source files - Single SSIS package

    Question

  • Hi all,

    I have got an requirement to load employee data from multiple sources like CSV, flat file, excel ( each file will have different columns & unique structure, but contains employee grade information).. My task is to aggregate grade information from multiple source files for all employees and store it in the final sql employee grade aggregated table.. Will it possible to accomplish this task in single ssis 2008 package with single data flow task.. if it so how the connections for these source files are made at run time or else can we do it in muliple data flow task..please advise..

    Thanks in advance.


    Balamurugan

    Saturday, February 16, 2013 11:15 AM

Answers

  • Glad to hear it was helpful.  Without knowing the details, I'd say everything looks good from a high level - assuming there are 4 file types that would need to be processed.

    The only improvement I can see would be dependent on the source files.  If you are able to determine the file type based on file name/extension and you are processing LOTS of data, you might want to have 4 separate for-each containers to take advantage of parallelism (ie Container 1 processes any CSV files, Container 2 processes tab delimited, Container 3 processes Excel, etc).

    Hope the rest of the development goes smoothly for you.

    -D

    • Marked as answer by Bala Murugan R Monday, February 18, 2013 3:24 AM
    Sunday, February 17, 2013 4:36 AM

All replies

  • Hi Balamurugan,

    "each file will have different columns & unique structure, but contains employee grade information."

    with above mentioned condition - manually doing setting column mapping might consume time,try with BULK INSERT/OPENROWSET

    BULK INSERT dbo.ImportTest FROM 'C:\Textfile.txt' WITH ( FIELDTERMINATOR =',', FIRSTROW = 2 )

    INSERT INTO dbo.ImportTest SELECT * FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;Database=C:\ImportData.xls', [Sheet1$])

    try this link - http://www.mssqltips.com/sqlservertip/1207/different-options-for-importing-data-into-sql-server/

    If all files are in same structure ,using FOR EACH LOOP container,we can eaily loop through the files.

    http://www.sqlis.com/sqlis/post/Looping-over-files-with-the-Foreach-Loop.aspx

    http://www.bidn.com/blogs/mikedavis/ssis/625/loop-through-excel-file-in-ssis

    http://blogs.msdn.com/b/changliw/archive/2010/07/19/ssis-dynamically-specify-folder-path-for-foreach-file-enumerator-in-foreach-loop-editor-task.aspx


    Thanks & Regards, sathya




    Saturday, February 16, 2013 11:42 AM
  • Thanks Sathya,

    But in our project, we need to load only the aggregated grade data for each employee from multiple data sources.. Moreover we need to vaidate the data from each sources and provide error messages before doing aggregation..

    Is this possible in having bulk insert task within one data flow task of  FOR EACH  loop container  which loops all files in source folder.. Or do i need to have multiple data flow task to achieve this. Please advise

    Thanks again

    balamurugan.r


    Balamurugan

    Saturday, February 16, 2013 11:52 AM
  • Hi Bala,

    If you need to handle the flow i.e,if 1 insert fails rollback entire transaction,if this is the case put all BULK  INSERT/OPENROWSET in one EXECUTE SQL Task with COMMIT/ROLLBACK statement.

    or create stored procedure like below and call the stored procedure using EXECUTE SQL Task

    CREATE PROCEDURE bulkinert_task
    --EXEC bulkinert_task
    AS
    BEGIN
    BEGIN TRY
    BEGIN TRANSACTION
    BULK INSERT dbo.ImportTest 
    FROM 'E:\blackbox.txt' 
    WITH ( FIELDTERMINATOR =',', FIRSTROW = 2 )
    COMMIT TRANSACTION
    END TRY
    BEGIN CATCH
    IF (@@TRANCOUNT > 0) 
    ROLLBACK TRAN
    END CATCH
    END

    try like below example to handle error, if your importing using SOURCE & DESTINATION task - http://sathyadb.blogspot.in/2012/09/sql-server-integration-services-error.html


    Thanks & Regards, sathya






    Saturday, February 16, 2013 11:56 AM
  • Thanks Sathya,,

    But how can i validate & provide custom error messages for all the source data before doing aggregation... Normally we will add some dot net code in script component to validate all the source data in pipeline within data flow task.. How this can be achievable in bulk insert for all source files...

    In addition to that , We don't have unique employee id in all source files so we planned to use fuzzy look up to retrieve employee data from different sources.

    for instance..employee E1 will have name like 'Andrew' in source file 1 and  'Mr Andrew' in source file 2.. etc..

    please advise..

    thanks again

    balamurugan.r


    Balamurugan


    Saturday, February 16, 2013 12:14 PM
  • Hi Bala - I would consider breaking the SSIS package into multiple parts.  It sounds like you are having to perform several distinct tasks with the data, so I would break things into chunks related to the tasks.  

    1) Gather information from multiple file sources into a single dataset

    - Have a separate dataflow task for each source, write the data to a single (or multiple) staging tables

    2) Perform validation and assign keys

    - Use either T-SQL or dataflow tasks (or a combination of both) to perform the specific underlying operation

    3) Aggregations

    - Write a SQL query against your stage tables to aggregate the data.  Use this as the source in a dataflow task to insert into your destination table.

    In all likelihood, yes, you could accomplish this through a single dataflow task, but it would be horrific trying to support(in my opinion).

    Saturday, February 16, 2013 3:59 PM
  • Thanks a lot for your information.

    So i planned to go with multiple data flow task for each source file.. Could you pleaase cofirm my understanding..

    Thanks again.


    Balamurugan

    Sunday, February 17, 2013 3:37 AM
  • Glad to hear it was helpful.  Without knowing the details, I'd say everything looks good from a high level - assuming there are 4 file types that would need to be processed.

    The only improvement I can see would be dependent on the source files.  If you are able to determine the file type based on file name/extension and you are processing LOTS of data, you might want to have 4 separate for-each containers to take advantage of parallelism (ie Container 1 processes any CSV files, Container 2 processes tab delimited, Container 3 processes Excel, etc).

    Hope the rest of the development goes smoothly for you.

    -D

    • Marked as answer by Bala Murugan R Monday, February 18, 2013 3:24 AM
    Sunday, February 17, 2013 4:36 AM