none
type 1 and type 2 slowly change dimension

    Question

  • I am new to data warehousing

    3 questions regarding SCD type 1 and 2.

    1.

    In type 1 SCD, what's the best practice to load dimensions? Do I have to insert new and update existing based on natural key? Or can I just truncate dimension table then do a full load?

    2. In type 2 SCD, if I understand correctly, there can be more than one records with the same natural key with different surrogate keys? Then when you load the facts, how to decide what dimension surrogate key to lookup? Is it to pick up the latest record (same natural key with maximum surrogate key, or same natural key with end date is null if there is a end date implemented on the dimension table)?

    3. In type 2 SCD, there can be more than one records with the same natural key with different surrogate keys. Then these records are treated as different dimension members. So in SSAS, dimension design, these will appear to be different members but with probably same name.

    E.g.

    surrogate1, customer1, property1, enddate1

    surrogate2, customer1, property2, enddate2

    surrogate3, customer1, property3, null

    These will be three different member in customer dimension,but looks the same to end user because all customer1. Is this ok?

    Thanks

    Wednesday, April 16, 2014 8:28 AM

Answers

  • 1. It depends. If you are at Stage/Transient level, you might consider truncation and reload (presuming its not a very large load). Otherwise, former. 

    2. Generally SCD2 is accompanied with a flag column which can identify the active / inactive record sets. 

    3. It depends. You can either build up a formal de-duplication mechanism or use pre-filtered (for active records) data only. 

    Hope it clarifies. 


    Vikas S. Rajput

    • Proposed as answer by Vikas S. Rajput Wednesday, April 16, 2014 9:17 AM
    • Marked as answer by thotwielder Thursday, April 17, 2014 8:50 AM
    Wednesday, April 16, 2014 9:14 AM

All replies

  • 1. It depends. If you are at Stage/Transient level, you might consider truncation and reload (presuming its not a very large load). Otherwise, former. 

    2. Generally SCD2 is accompanied with a flag column which can identify the active / inactive record sets. 

    3. It depends. You can either build up a formal de-duplication mechanism or use pre-filtered (for active records) data only. 

    Hope it clarifies. 


    Vikas S. Rajput

    • Proposed as answer by Vikas S. Rajput Wednesday, April 16, 2014 9:17 AM
    • Marked as answer by thotwielder Thursday, April 17, 2014 8:50 AM
    Wednesday, April 16, 2014 9:14 AM
  • 1. It depends. If you are at Stage/Transient level, you might consider truncation and reload (presuming its not a very large load). Otherwise, former. 

    2. Generally SCD2 is accompanied with a flag column which can identify the active / inactive record sets. 

    3. It depends. You can either build up a formal de-duplication mechanism or use pre-filtered (for active records) data only. 

    Hope it clarifies. 


    Vikas S. Rajput

    Thanks. For 1, do you mean for initial load, I can use full load, then for incremental load, I do then insert and update?
    Wednesday, April 16, 2014 9:51 AM
  • Update is generally preferred at the later stages, like Warehouse. Transient/Stage are generally truncate and load runs, so full run generally suits well there. 

    Hope it helps.


    Vikas S. Rajput

    Thursday, April 17, 2014 11:45 AM
  • Also, please note that ST/TR/DW processes are generally treated differently from DM (or marts). Ideally, Marts refreshes shall be marked with reference / release counter so you can track changes to what data has been released to clients. 

    Thanks.


    <a href="au.linkedin.com/in/rajputsvikas‎">Vikas S. Rajput</a>

    Thursday, April 17, 2014 11:48 AM