none
Managing Data Update

    Domanda

  • Hi Everyone,

    I have a data warehouse project I am working on and the client wants one of the dimension tables to get updated after the end of the month even if the values in the source database.

    In summary, they want new records to be inserted into the dimension table the next time the job runs (this runs every hour) once there is a new record in the source table but once an existing record is updated in the source database, the target table will only updated at the end of the month.

    Any idea on how to manage this kind of operation.

    Thanks


    me

    lunedì 4 giugno 2018 16:40

Risposte

  • I believe your dimension is Type-1. In that case, you can achieve this goal by separating the INSERT and UPDATE logic for the dimension and schedule the UPDATE logic only once a month. However, the bigger question is that what is the rationale behind this split behavior? Is it for performance purpose or any other reporting needs? Doesn't it impact your slicing and dicing? If the update only accounts for attributes that are not involved in the aggregations or analytics then it is okay (still what's the harm in updating them in the next job) however if those attributes are part of the aggregations then you would want to evaluate the impact of doing the dimension load in this fashion.

    HTH,
    Cheers!!
    Ashish
    Please mark it as Answered if it answered your question or mark it as Helpful if it helped you solve your problem.


    lunedì 4 giugno 2018 17:04
  • Let me replay this to ensure my understanding is correct

    1. You need to capture new entries in your source table as and when they get created and should be processed to your data warehouse when the job runs daily (every hour)

    2. In addition you need to have the modifications to the existing entries to be picked up one time at the end of the month

    Assuming the above what you need is a setup like this

    1. have two columns datecreated and datemodified as audit trail columns in your transaction table. Each can be tied to table using DEFAULT constraints so as to automatically default to current system time for INSERT operations. In addition have a trigger to update datemodified to current system time for update operations

    2.  Have a ETL logic which will work in two modes (D-Daily, M-Monthly)

    Assuming you will have staging table which does one to one transfer between source and DW, your ETL logic will check for datecreated > {previousload date} during daily loads (mode D)

    During month end load the logic will change to datecreated >= monthstart or datemodified >= monthstart

    3. Once inside staging table which would be truncate load the next part will be the procedure which does Merge logic to dimension using staged data from the source

    the logic will look like this

    MERGE dimtable AS t
    USING Staging AS s
    ON s.businesskey = t.businesskey
    ..
    
    WHEN MATCHED 
    UPDATE t SET col1=s.col1,... ....
    WHEN NOT MATCHED BY TARGET
    INSERT t
    VALUES(....)
    ...

    so during daily loads only new values will be present so it will do INSERT alone

    on month ends it will have modified values too and will hit the MATCHED BY target to do the updates


    Please Mark This As Answer if it solved your issue
    Please Vote This As Helpful if it helps to solve your issue
    Visakh
    ----------------------------
    My Wiki User Page
    My MSDN Page
    My Personal Blog
    My Facebook Page



    • Modificato Visakh16MVP lunedì 4 giugno 2018 18:06
    • Contrassegnato come risposta lilutchay martedì 5 giugno 2018 17:11
    lunedì 4 giugno 2018 17:06

Tutte le risposte

  • I believe your dimension is Type-1. In that case, you can achieve this goal by separating the INSERT and UPDATE logic for the dimension and schedule the UPDATE logic only once a month. However, the bigger question is that what is the rationale behind this split behavior? Is it for performance purpose or any other reporting needs? Doesn't it impact your slicing and dicing? If the update only accounts for attributes that are not involved in the aggregations or analytics then it is okay (still what's the harm in updating them in the next job) however if those attributes are part of the aggregations then you would want to evaluate the impact of doing the dimension load in this fashion.

    HTH,
    Cheers!!
    Ashish
    Please mark it as Answered if it answered your question or mark it as Helpful if it helped you solve your problem.


    lunedì 4 giugno 2018 17:04
  • Let me replay this to ensure my understanding is correct

    1. You need to capture new entries in your source table as and when they get created and should be processed to your data warehouse when the job runs daily (every hour)

    2. In addition you need to have the modifications to the existing entries to be picked up one time at the end of the month

    Assuming the above what you need is a setup like this

    1. have two columns datecreated and datemodified as audit trail columns in your transaction table. Each can be tied to table using DEFAULT constraints so as to automatically default to current system time for INSERT operations. In addition have a trigger to update datemodified to current system time for update operations

    2.  Have a ETL logic which will work in two modes (D-Daily, M-Monthly)

    Assuming you will have staging table which does one to one transfer between source and DW, your ETL logic will check for datecreated > {previousload date} during daily loads (mode D)

    During month end load the logic will change to datecreated >= monthstart or datemodified >= monthstart

    3. Once inside staging table which would be truncate load the next part will be the procedure which does Merge logic to dimension using staged data from the source

    the logic will look like this

    MERGE dimtable AS t
    USING Staging AS s
    ON s.businesskey = t.businesskey
    ..
    
    WHEN MATCHED 
    UPDATE t SET col1=s.col1,... ....
    WHEN NOT MATCHED BY TARGET
    INSERT t
    VALUES(....)
    ...

    so during daily loads only new values will be present so it will do INSERT alone

    on month ends it will have modified values too and will hit the MATCHED BY target to do the updates


    Please Mark This As Answer if it solved your issue
    Please Vote This As Helpful if it helps to solve your issue
    Visakh
    ----------------------------
    My Wiki User Page
    My MSDN Page
    My Personal Blog
    My Facebook Page



    • Modificato Visakh16MVP lunedì 4 giugno 2018 18:06
    • Contrassegnato come risposta lilutchay martedì 5 giugno 2018 17:11
    lunedì 4 giugno 2018 17:06
  • Hi Visakh16,

    Did you mean to use Staging table in the USING clause instead of the SourceTable ?


    HTH,
    Cheers!!
    Ashish
    Please mark it as Answered if it answered your question or mark it as Helpful if it helped you solve your problem.

    lunedì 4 giugno 2018 17:55
  • Hi Visakh16,

    Did you mean to use Staging table in the USING clause instead of the SourceTable ?


    HTH,
    Cheers!!
    Ashish
    Please mark it as Answered if it answered your question or mark it as Helpful if it helped you solve your problem.

    Corrected

    Thanks for the catch


    Please Mark This As Answer if it solved your issue
    Please Vote This As Helpful if it helps to solve your issue
    Visakh
    ----------------------------
    My Wiki User Page
    My MSDN Page
    My Personal Blog
    My Facebook Page

    lunedì 4 giugno 2018 18:06
  • You're most welcome!

    HTH,
    Cheers!!
    Ashish
    Please mark it as Answered if it answered your question or mark it as Helpful if it helped you solve your problem.

    lunedì 4 giugno 2018 18:24