none
Best approach for Loading 250+ tables in warehouse RRS feed

  • Question

  • Hello Gurus,

    I am stuck up designing the best approach for creating a SQL Server 2008r2 warehouse from an already existing oracle image and need your expert inputs to help me arrive at a solution.

    We are to create Type 2 history data (using SCD) of an existing oracle database to a fresh SQL Server instance. The source contains approximately 300 tables to load. There will be hardly any transformations required apart from SCD and some datatype conversions.

    From a maintainability point of view what is the best way to achive this.. Should we create 300 data flow tasks in a single package? Is something dynamic possible i.e. create a custom component and iterate it for each table getting the table metadata through some c# code? I am just speculating the latter as my c sharp skills are limited and I dont wish to go that way..

    Thursday, November 17, 2011 6:46 AM

Answers

  • Really it comes down to what you want history of, if you want to record every change in every column as an SCD2 change then scripting the packages either using the SSIS object model or creating dynamic SQL is a good option. I've implemented .NET code that generates SSIS packages to perform this style of change detection driven by selecting tables from a list in an application.

    If you do go down the path of creating SSIS packages I'd be cautious of implementing the SCD 2 changes using either the wizard or the Dimension Merge SCD. From my experience performance of the Dimension Merge SCD suffers with large tables as you're reading the entire target table into memory.

    Dynamic SQL would probably be the quickest option, particularly if your C# skills are limited. But you'd have a few considerations... Does your IT policy allow linked servers? Performance of using a linked server and openquery, can you perform change detection on the Oracle server side, or do you need to copy all the tables across to SQL Server each?


    David Stewart | My Microsoft BI Blog | @dstewartbi
    • Marked as answer by Aparash Friday, November 18, 2011 8:17 AM
    Thursday, November 17, 2011 8:13 AM
  • I can't unfortunately share any code, as I developed my solution under company time.
    However, these are the general steps you need to follow:

    * create a table with all the metadata. Source and destination. It must contain columns like SourceServerName, SourceSchemaName, SourceTableName et cetera.

    * using this table, connect to the source and get all the metadata for the source tables. I don't know how this is done, but in SQL Server you could use for example INFORMATION_SCHEMA or sys.columns. Generate the SELECT statements, using a WHERE clause to get only the new and changed rows. Add conversion if necessary, e.g. CLOB to VARCHAR. Put this statement inside a OPENQUERY statement using a linked server.

    * drop the data into a staging table at the destination using a SELECT INTO. You can add checksums if you want to detect if rows have really changed or not.

    * use the MERGE statement to merge the data from the staging table with your destination table.

    If you save all the SQL statements, you can loop over this statements to transfer the data. You only need to generate these statements once, and every time if a table has changed his metadata (or if new tables have been added). If you have multiple loops, you can execute data transfers in parallel.


    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    • Marked as answer by Aparash Friday, November 18, 2011 8:17 AM
    Friday, November 18, 2011 7:19 AM

All replies

  • It is possible to generate dynamic SQL scripts using the metadata from Oracle and SQL Server that will transfer the data between Oracle and SQL Server using a linked server.

    Create a table that contains all the connection information for the 300 tables (tablename, schema, database name, server et cetera. So 300 rows in that table). Loop over that table and create the SQL statements on the fly using dynamic SQL and the metadata of the source table. Watch out for datatypes that don't match, such as CLOB in Oracle. Since you almost don't do any transformations, the code itself shouldn't be too difficult. (just INSERTS and UPDATES for the SCD2).

    I've done it for DB2 and SQL Server itself, so I know it is possible :)


    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    Thursday, November 17, 2011 6:55 AM
  • I definitely wouldn't create 300 data flows in a single package, it's not maintainable and I doubt Visual Studio will be usable with that many objects on the canvas!

    You could potentially use the SSIS object model to create the packages, but I'd do this as a one-off rather than using code to iterate through each table each time you execute the packages. It depends on to what level you're capturing history as well. If you just want to record any change in any field, then it would be possible to generate the packages. If you want to choose certain columns to track history for then generating the packages via the SSIS object model isn't going to be viable without setting up metadata about each table.

    Assuming you don't have an existing SSIS framework and you just want fairly simple packages to perform the copy then you could use the SQL Server Import and Export Wizard to generate a basic SSIS package for each table - http://msdn.microsoft.com/en-us/library/ms140052.aspx

    Have you thought about how you're going to do change detection to create SCD Type 2 history?


    David Stewart | My Microsoft BI Blog | @dstewartbi
    Thursday, November 17, 2011 6:56 AM
  • Thanks David for your reply..

    The requirement for SCD is to Create an additional record with the changed value with a TO_DATE and FROM_DATE in DateTime format... I wont be using the inbuilt SCD component as it is very slow.. Ive thought of implementing it using dimensionmergescd.codeplex.com.

     

    Assuming I do the Import wizard, I would still require to manually go to each and every data flow and add an SCD task and map it to the destination?

    Thursday, November 17, 2011 7:59 AM
  • Assuming I do the Import wizard, I would still require to manually go to each and every data flow and add an SCD task and map it to the destination?


    Yes, you would have to adapt every single dataflow manually.
    Which of course you won't have to do with the solution I proposed ;)
    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    Thursday, November 17, 2011 8:01 AM
  • Really it comes down to what you want history of, if you want to record every change in every column as an SCD2 change then scripting the packages either using the SSIS object model or creating dynamic SQL is a good option. I've implemented .NET code that generates SSIS packages to perform this style of change detection driven by selecting tables from a list in an application.

    If you do go down the path of creating SSIS packages I'd be cautious of implementing the SCD 2 changes using either the wizard or the Dimension Merge SCD. From my experience performance of the Dimension Merge SCD suffers with large tables as you're reading the entire target table into memory.

    Dynamic SQL would probably be the quickest option, particularly if your C# skills are limited. But you'd have a few considerations... Does your IT policy allow linked servers? Performance of using a linked server and openquery, can you perform change detection on the Oracle server side, or do you need to copy all the tables across to SQL Server each?


    David Stewart | My Microsoft BI Blog | @dstewartbi
    • Marked as answer by Aparash Friday, November 18, 2011 8:17 AM
    Thursday, November 17, 2011 8:13 AM
  • Thanks Koen and David,

    The History data requirement isnt deep at all. Any change needs to be put as a separate column along with the addition of start and end dates. Im not sure as to how parallel execution can be achieved in my case through the SSIS object model dynamically. If you can share a few snippets on how to achieve that it would be great.

    Im only using the Dimension Merge SCD because as far as I know the default task doesnt support oracle. Please do share any gotchas with that if you have.

    Ill give the dynamic SQL approach a trial and try to parallel load multiple tables.

    Thanks again:)

    Thursday, November 17, 2011 12:30 PM
  • I can't unfortunately share any code, as I developed my solution under company time.
    However, these are the general steps you need to follow:

    * create a table with all the metadata. Source and destination. It must contain columns like SourceServerName, SourceSchemaName, SourceTableName et cetera.

    * using this table, connect to the source and get all the metadata for the source tables. I don't know how this is done, but in SQL Server you could use for example INFORMATION_SCHEMA or sys.columns. Generate the SELECT statements, using a WHERE clause to get only the new and changed rows. Add conversion if necessary, e.g. CLOB to VARCHAR. Put this statement inside a OPENQUERY statement using a linked server.

    * drop the data into a staging table at the destination using a SELECT INTO. You can add checksums if you want to detect if rows have really changed or not.

    * use the MERGE statement to merge the data from the staging table with your destination table.

    If you save all the SQL statements, you can loop over this statements to transfer the data. You only need to generate these statements once, and every time if a table has changed his metadata (or if new tables have been added). If you have multiple loops, you can execute data transfers in parallel.


    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    • Marked as answer by Aparash Friday, November 18, 2011 8:17 AM
    Friday, November 18, 2011 7:19 AM
  • Hi, as already mentioned there are two options in general: dynamic sql or dynamic package generaiton using SSIS API. In the end, it will depend on your skills which one to choose. Personally i would prefere to use SSIS when moving/transforming data between different systems.

    Here is an example of how to use SSIS API to generate and execute packages.

    http://sqlsrvintegrationsrv.codeplex.com/releases/view/17647

    Friday, November 18, 2011 8:13 AM
  • I can't unfortunately share any code, as I developed my solution under company time.
    However, these are the general steps you need to follow:

    * create a table with all the metadata. Source and destination. It must contain columns like SourceServerName, SourceSchemaName, SourceTableName et cetera.

    * using this table, connect to the source and get all the metadata for the source tables. I don't know how this is done, but in SQL Server you could use for example INFORMATION_SCHEMA or sys.columns. Generate the SELECT statements, using a WHERE clause to get only the new and changed rows. Add conversion if necessary, e.g. CLOB to VARCHAR. Put this statement inside a OPENQUERY statement using a linked server.

    * drop the data into a staging table at the destination using a SELECT INTO. You can add checksums if you want to detect if rows have really changed or not.

    * use the MERGE statement to merge the data from the staging table with your destination table.

    If you save all the SQL statements, you can loop over this statements to transfer the data. You only need to generate these statements once, and every time if a table has changed his metadata (or if new tables have been added). If you have multiple loops, you can execute data transfers in parallel.


    MCTS, MCITP - Please mark posts as answered where appropriate.
    Answer #1: Have you tried turning it off and on again?
    Answer #2: It depends...
    checksum to detect changes? there are cases where the generated checksum is the same even if you have a change in your row... :)
    Monday, October 10, 2016 9:40 AM
  • checksum to detect changes? there are cases where the generated checksum is the same even if you have a change in your row... :)
    Hashes would be a better choice.
    If there are changes and the hashes are the same, you are either using a very weak hashing algorithm, or you're not using correct delimiters.

    MCSE SQL Server 2012 - Please mark posts as answered where appropriate.
    My blog

    Wednesday, October 12, 2016 9:36 PM