none
Data warehouse Loading strategy question RRS feed

  • Question

  • Hi All,

    We have an SQL Database in DWH design, which is truncated and loaded every night in a batch process. However, sometimes the DW Load fails and leaves the DW tables empty due to data issues. The business wants these tables to have the previous day's data if the batch fails.

    Some simple recommendations (on blog posts) are : 

    1) Load the new data into a separate set of tables (say secondary set), and once the load is successful, RENAME the tables (swap primary & secondary sets). Is this correct approach? What could be the performance implications of swapping table names everyday?

    2) Blue Green approach with 2 databases for dwh. But this will require a lot of code changes.

    3) Implement incremental loading (again, needs a lot of code changes)

    Can you please suggest any good approach to implement the requirement?



    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.


    Tuesday, July 14, 2020 1:08 PM

Answers

  • Hi


    Why not having a staging tables  with the same structure as original table,  such as your load will be done into a staging table ( first step is truncate a staging table) and then perform insert into fro  the staging table into original

    If the load failed the original tables still have data , so you can fix the issue  without affecting the business 

    Another approach is to load only current date data and issue DELETE on the original table only for that day ( if load the data  a few times a day) means do not DELETE entire table but only current day data...


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    • Marked as answer by Ashu_Blueray Wednesday, July 15, 2020 2:17 PM
    Wednesday, July 15, 2020 4:19 AM
    Moderator

All replies

  • The processing strategy of incremental loading is not static, which kind of loading strategy is suitable? It has a lot to do with the design of the data source. A good data source design may directly provide the most direct judgment basis for subsequent incremental processing, such as self-increasing columns and time and date stamps. There may be triggers added to the data source design, and effective log records are made during the process of adding, modifying, or deleting data. Or adding some audit tables, record and track the details of the data operation during the addition, deletion and modification of the data, then this can also be flexibly adopted the above several incremental loading strategies to design the process in accordance with the current system.

    I.Absolute historical data table with timestamp or self-increasing column, then the incremental processing strategy for such tables is:

        1. After the first loading action is completed, record the maximum time point and save it in a loading record table.

        2. From the second load, compare the last/maximum time point saved in the last operation, and only load the data after this time point. When the loading process is successfully completed, update the loading record table to update the last time point.

        3. In addition, if this type of table has self-increasing columns, then you can also use self-increasing columns to achieve this identification feature

    II.Data tables with modified time characteristics, then in this case the strategy for incremental data processing can be:

        1. After the first loading action is completed, record the largest UpdateDate timestamp and save it in a loading record table.

        2. When the data is loaded for the second time, the timestamp in the loading record table is compared with the UpdateDate in the source table. If the timestamp is larger, it means that the data is newly added or modified.

        3. After the entire loading process is successful, update the largest UpdateDate to the record table.However, it should be noted that not every data table with modified time characteristics will be designed in this way. It is possible that when inserting data, it will only be placed in CreateDate but not written in UpdateDate. In this case, you may need to compare CreateDate and UpdateDate at the same time during each load.

    However, it should be noted that not every data table with modified time characteristics will be designed in this way. It is possible that when inserting data, it will only be placed in CreateDate but not written in UpdateDate. In this case, you may need to compare CreateDate and UpdateDate at the same time during each load.

    Another very important question is: how to deal with the failure during the incremental loading process? Considering the efficiency problem, do not want to reload every time, so you can consider the following two ways:

    1.SSIS Package process log and error log mode;

    2.Second way-checkpoint in SSIS Package.

    The strategy of incremental data loading has a great relationship with the data source and also with the actual needs. Therefore, in the process of designing the incremental data loading, you need to think carefully about the actual needs and the incremental change characteristics that the data source can provide. Repeated testing to achieve the stability and reliability of the loading mechanism.

    Wednesday, July 15, 2020 2:56 AM
  • Hi


    Why not having a staging tables  with the same structure as original table,  such as your load will be done into a staging table ( first step is truncate a staging table) and then perform insert into fro  the staging table into original

    If the load failed the original tables still have data , so you can fix the issue  without affecting the business 

    Another approach is to load only current date data and issue DELETE on the original table only for that day ( if load the data  a few times a day) means do not DELETE entire table but only current day data...


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    • Marked as answer by Ashu_Blueray Wednesday, July 15, 2020 2:17 PM
    Wednesday, July 15, 2020 4:19 AM
    Moderator
  • Hi Uri, 

    Thank you for the response. We have approx 60 tables and some of them have 1 million+ rows. Instead of doing INSERT INTOs, Can we use a table rename (swap staging with main) after the staging is loaded successfully? 

    There are approx 30 stored procedures and other objects that use the main tables.

    I somehow don't feel comfortable renaming 60+ objects everyday in Production as I think it may break something. :)



    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    Wednesday, July 15, 2020 8:21 AM
  • Can we use a table rename (swap staging with main) after the staging is loaded successfully? 

    You could, but it's easier to use ALTER TABLE .. SWITCH PARTITION to exchange the data from stage to prod table

    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    Wednesday, July 15, 2020 8:29 AM
    Moderator
  • I would not use table rename as it requires more maintenance  than using simple staging table.

    Are you using SSIS package? I was not talking about remaining tables, I meant that you would use a staging table and the insert into a main table, should take much time especially if you load only current date data.

     

    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Wednesday, July 15, 2020 9:31 AM
    Moderator
  • Can we use a table rename (swap staging with main) after the staging is loaded successfully? 

    You could, but it's easier to use ALTER TABLE .. SWITCH PARTITION to exchange the data from stage to prod table

    Olaf Helper

    [ Blog] [ Xing] [ MVP]

    this is a good pattern, i've used it successfully plenty of times when suitable

    one thing to remember is that alter partition is a schema-m lock - a super blocking single user lock on the table. It even blocks dirty reads

    something to be aware of if the DW is being queried during the ETL (eg an autorefresh dashboard or trickle feed to something) - you might get random deadlocks


    Jakub @ Adelaide, Australia Blog

    Monday, July 20, 2020 3:31 AM