none
Staging tables RRS feed

  • Question

  • Hi All,

    We are in the process of creating a data warehouse using SSIS and I was wondering if someone could provide their thoughts on staging tables. How often do you bring table from source system to the staging database? Do you guys truncate all the tables first in the staging database before inserting data from source tables? We are thinking of truncating all the table sin the staging database before loading them with fresh data, is that a right approach?

    Thanks in advance for your help.

    Sunday, January 30, 2011 3:10 PM

Answers

  • Hi,

    I think it depends on volume of data to take a call on whether to retain data within staging tables and the requirement of working with data across days. There could be performance impact if the data is huge and it may not be required to hold the data in staging table unless it is really required. Having said that, we can even have the reprocessing of specific feed as a design aspect which would take care of modifying data loaded into warehouse from specific extract when there is a need.

    In general, truncate table, load data initially into staging tables, apply business rules, and perform appropriate transformations and then move it across to warehouse. Same logic will be applied for reprocessing as well. The reprocessing logic will have intelligence built in to pick up those records loaded from the specific extract that is being reprocessed and then reflect the changes net-net on completion of reprocessing the data.

    HTH.

    Regards,


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Monday, January 31, 2011 6:44 AM
  • Instead of truncating the staging table you can keep the records in it for 1 week or 1 month. The idea is to refer the staging table for any future reference or type of troubleshooting. You need to add another field (say SequenceNumber) in each staging table to keep track of individual day record.

    Can do, but when data volumes get large, that will hurt performance downstream.

    I prefer to keep the raw files for a week or so instead - assuming you are filling the staging tables from raw (like .csv) files!

    Josh

     

    Monday, January 31, 2011 6:53 PM

All replies

  • Staging tables are normally used when developers need to store the data they want to transfer from a source into a destination, to do some processing before loading into final destination. In other cases, developers use a "staging repository" to store their data and do their testing, before loading it into the final database server. This way, the minimal processing burden on the live server is ensured.

    So, unless you need to do any of these, you don't really need staging tables (staging server).


    Please mark as answer if this helps. Thank you! || http://thebipalace.wordpress.com/
    Sunday, January 30, 2011 4:19 PM
  • It's a pretty commonly used database pattern.

    I very frequently use staging tables, yes truncate them first then load in from a .csv file or other.

    This is most often done when the incoming data needs to be inspected, cleaned up, transformed, and last but not least, distributed to several destinations.  All of these things can be done, to some degree, in SSIS tasks, or they can all be done within the database, in TSQL.  Generally speaking, I prefer TSQL as a tool, and generally speaking, I think I get better performance that way. 

    Your mileage may vary.

    Josh

     

    • Proposed as answer by Jon Kleinhans Sunday, January 30, 2011 8:04 PM
    Sunday, January 30, 2011 7:30 PM
  • Instead of truncating the staging table you can keep the records in it for 1 week or 1 month. The idea is to refer the staging table for any future reference or type of troubleshooting. You need to add another field (say SequenceNumber) in each staging table to keep track of individual day record.

     

     

    Monday, January 31, 2011 6:30 AM
  • Hi,

    I think it depends on volume of data to take a call on whether to retain data within staging tables and the requirement of working with data across days. There could be performance impact if the data is huge and it may not be required to hold the data in staging table unless it is really required. Having said that, we can even have the reprocessing of specific feed as a design aspect which would take care of modifying data loaded into warehouse from specific extract when there is a need.

    In general, truncate table, load data initially into staging tables, apply business rules, and perform appropriate transformations and then move it across to warehouse. Same logic will be applied for reprocessing as well. The reprocessing logic will have intelligence built in to pick up those records loaded from the specific extract that is being reprocessed and then reflect the changes net-net on completion of reprocessing the data.

    HTH.

    Regards,


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Monday, January 31, 2011 6:44 AM
  • Instead of truncating the staging table you can keep the records in it for 1 week or 1 month. The idea is to refer the staging table for any future reference or type of troubleshooting. You need to add another field (say SequenceNumber) in each staging table to keep track of individual day record.

    Can do, but when data volumes get large, that will hurt performance downstream.

    I prefer to keep the raw files for a week or so instead - assuming you are filling the staging tables from raw (like .csv) files!

    Josh

     

    Monday, January 31, 2011 6:53 PM
  • Ohh!!! , I am late for this thread but I have one Idea!!!

     In staging we used to create the four tables One Base table another Success table and third one is Discarded and fourth History table. So when data came we push into the base table, after validation if data pass the validation that all records we push in success table, fail records in Discard table. Once Success data process for OLTP or OLAP we push this all to History table.

     In this way we resolved all the performance issue as we have to read data from Success table only and that have only current data.

     


    Regards Vikas Pathak


    Wednesday, April 6, 2016 1:03 PM