Load fact table with SCD2 tips


  • Hi there,

    I am pointing to refresh load of fact table. I have 25 milion rows of facts (last thee years) and many dimensions. One of them is SCD2 type with one history-kept field. This dimension has about 60 000 business keys and in average every business key is stored in 10 variants.

    So I have to load fact table. I know that there is possibility of LOOKUP transformation with parameters , which is slow. I know that I can load into dataflow my facts and all dimension rows and conditionally find correct dimID, but it means, that if I have in average 10 versions of each business key, that I will get 9*25m useless rows in dataflow. I know about third variant ,where I can transform dimension valid_from and valid_to into rows for each date between such dates and cache such result and use it in lookup. But with my facts from last three years it means, that I have to cache something about 900 (days) * 60 000 rows.

    So I need to develop something better in custom script, my question is ,what are your tips, what kind of search algorithm to use.

    It is not big deal with 25m rows, I can simply use second variant and merge dim data with facts. But if I meet later 750m facts, then it will be painfull

    thanks for tips

    Tuesday, August 06, 2013 4:58 PM


All replies

  • So, the question is if you have the relationships already set. If yes, disable the relational constraints (may even perhaps drop them) and load the data using the BULK LOAD method (AKA "Fast Load") in SSIS; then re-enable constraints, or re-introduce them.

    PS: Dim tables typically hold the parent identity records you link to the fact.

    Arthur My Blog

    Tuesday, August 06, 2013 5:32 PM
  • The data for fact table is usually staged in raw\staging table, from where it is loaded into the fact table. Now, the fact table has FK referred to the respective dims. Therefore, we need to LOOK up for those PK (surrogate keys) of the dim based on business key join and then load.

    Since your dim is having only 60K records, it will be quick if you use SQL query in the LOOK up componet as

    WHERE Current_Flag='Y'
    Since you have SCD typ2 dim, I beleive you must have field such as Current_Flag to identify current records. This will further reduce the number of records returned for LOOK Up

    You, will not cache 60K records in look up...

    Thanks, hsbal

    Tuesday, August 06, 2013 8:58 PM

  • My dim has about 540k rows, I have 60k only of business keys. I am speaking about refresh load, so I can't rely on currently flagged records.  In fact I can't rely on actually flagged records also in incremental load, because I get data for last day + sometimes late records for last 2 weeks.
    Wednesday, August 07, 2013 5:06 AM
  • With a complete refresh you do disable or drop PK-FK constraints and then re-match and re-establish the relationships.

    Arthur My Blog

    Wednesday, August 07, 2013 3:47 PM
  • I understand what you say, but it wasnt my question, maybe I asked by wrong words. My question is, what is the fastest ways of refresh load in case, where you load 25m rows and you have to add correct key from dim of 600 000 rows (60000 active flags + 540 000 non active variants) ? Bulk load is clear, but most importatn is the phase when assigning scd2 dim key for fact rows in data flow.

    Sorry for late answer, I was not able to connect......

    Friday, August 16, 2013 10:52 AM
  • If it is prod change, best is to copy the 25m rcds into physical temp table, perform the operation on it with correct keys, verify the data and then replace the this correct data with the original data in the fact table...

    Thanks, hsbal

    Friday, August 16, 2013 1:05 PM