none
Staging Tables RRS feed

  • Question

  • Hi There

    we do get large files from vendor and needs to be loaded in to SQL server, The plan is to have raw data on to staging and convert the datatypes and load  to the final tables. Once the final tables are loaded we do not need the staging tables.

    Is it a good practice to create objects (staging) and drop after the Final table was loaded. Can we use temporary tables here and can it accommodate millions of records in Temp tables

    Please advice

    Thanks 

    Tuesday, July 9, 2019 5:55 PM

All replies

  • Hi Rajm0019,

    1. There is nothing wrong with your approach. Load raw data into a temporary table, and MERGE it into a final destination.
    2. Though there is even better approach.
      If you can get files from the vendors in XML format as well as valid against an XSD schema contract. This way problems with data quality stay at the system of origin, the sender.
    Tuesday, July 9, 2019 6:05 PM
  • Thanks Yitzhak for the response.

    what the difference in performance when you create a physical staging table vs Temp staging tables. which executes faster?

    WHat if temp db is has space issues ? 

    Thanks

    Tuesday, July 9, 2019 6:11 PM
  • Thanks Yitzhak for the response.

    what the difference in performance when you create a physical staging table vs Temp staging tables. which executes faster?

    WHat if temp db is has space issues ? 

    Thanks

    There is no difference between a physical staging table and a temp staging table.

    Though they created in the different databases, user DB and TempDB.

    Obviously, you need to manage a proper capacity planning for both types of the DBs. 

    Tuesday, July 9, 2019 6:17 PM
  • Hi Rajm0019,

    Physical tables are USUALLY created to store the application data. They store all the persistent data that your application needs.
    Temporary tables are used to store temporary information. Most of the times you dont need to keep them after you are done with the data. For example, a reporting stored procedure might decide to store the intermediary results in a temp table and process them for better performance.
    So use TEMP tables to store all the temporary information and physical tables to store all the REAL data.

    Please refert to the following links.

    Temp Tables vs. Physical Tables

    Differences Between SQL Server Temp Tables, Physical Tables.

    Best Regards, 

    Mona




    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, July 10, 2019 8:36 AM
  • Creating objects on the fly is an option, but you should make sure you're managing your TempDB space requirements appropriately so you don't get bogged down with a large input set. 

    Another approach is physical input tables. In our system, we logically have input tables, transactional tables and output tables. When processing large data sets, we input them directly into the input tables (non-indexed, faster disk group), process/transform the data, as needed (apply lookups, transformations, etc) then we push the data into the transactional section. After processing, the input files are kept in the input tables for two weeks before being exported to flat files and purged from the input tables (short-term archive). 

    HTH

    Thursday, July 11, 2019 6:43 PM
  • Do you need to drop the table?

    It is my normal process to move raw data from files into stage table(s), validate the data, then merge into target tables, and remove the stage data from the stage table(s).

    Thursday, July 11, 2019 7:36 PM
  • Yes , We need to drop all stage tables. Here is my ETL process.

    1) Step load Raw data to staging tables, My step will check for staging tables before loading  with below 

    IF OBJECT_ID('stage.Product, 'U') IS NOT NULL
    DROP TABLE stage.Product;

    Create table stage.Product

    ( id varchar(10))

    2) Once loaded the raw data I need to some data conversion that needs to be populated to final table. But I want to make sure that the data conversion happens smoothly before truncating the final table. SO I have another staging table with data types sync with final tables.

    IF OBJECT_ID('stage1.Product, 'U') IS NOT NULL
    DROP TABLE stage1.Product;

    Create table stage1.Product

    ( id int)

    3) Once the above step succeed I will be truncating the final table and load the data from Stage table.

    4) Once step 3 is done , I am dropping objects for stage and stage1

    The SSIS package on next execution failing because the stage tables have not found as it was dropped after loading to final table , I made delay validation to true on OLEDB source but still failing on pre execute , Any advices how to handle this


    Thanks




    • Edited by Rajm0019 Thursday, July 11, 2019 8:35 PM
    Thursday, July 11, 2019 8:12 PM
  • Correct.  SSIS needs the table to exist to "design" the ETL.

    I would highly suggest you not drop the stage tables and just truncate them, unless the schema is different every time.

    If you don't want the stage tables in your actual database, create a new "staging" database and use it. That is normally how I do it. There is no real performance difference between accessing tables in 2 databases and the same database on the same server.

    I admit I do have a SSIS process which dynamically creates, populates and drops the stage tables.  Because the source schema can change out of my control.  However, it is extremely complicated to use and manage and I used BIML.  After thinking about it more, I would not do it that way again or recommend anyone else do it.


    Friday, July 12, 2019 12:10 PM