locked
Incremental Load: What is the best approach and db design to deal with millions of records? RRS feed

  • Question

  • Hi All,

    Lots of questions came to my mind when I started designing an incremental approach in one of our requirement. Please suggest the best options to resolve those issues.

    We are having millions of records in our source and we are getting around 3 lakhs of records per day. As of now, it is running in full load and already passed 1.5 million and we need to change that into incremental approach. We have started with SSIS but it is extremely slow and now we have changed to Stored procedure. We will be doing some transformations, aggregations to generate multiple dashboards out of it. We need both detail and aggregated data to be shown in multiple dashboards. Detail level data is required only when we get the id passed from the dashboard to the back end SP.

    The current design have staging, etl and reporting tables. Staging tables have been designed to load all the files which we are receiving from the client. Multiple tables to be joined and loaded to etl table. This is also a detail table like staging as we need detail data in the dashboard on request based on the id.

    1. It is going to be direct insert(incremental) to staging tables as we mostly expect new records. But, there are chances of getting existing records as well. Using SSIS package for data load. Should I consider truncate and reload for staging table? but I will miss actual data in that case and need to rely on file if anything required.

    2. ETL table is going to loaded based on last loaded date. Should I introduce a batch id to identify record if suppose the data load happen more than once a day?

    3. We are using stored procedure to insert ETL table. Should we go with SSIS? Which is the best approach?

    4. Report tables are aggregates on multiple level except one which is referring the etl table to get details based on input parameter. Using Store procedures to populate all reporting tables. 

    Some more questions to ask, but will solve the 1st part and then solve the rest.

    Please suggest.

    Thanks

    Julie

    Sunday, July 19, 2020 7:51 PM

All replies

  • here's my standard pattern for traditional DB loads - this should cover your qus 1-3.

    re qu4. i usually use views to do logical transformations and that's fast enough because most reporting tools these days have their own internal columnstore db so end up creating an extract anyway vs going live

    * take a copy of your source data
    * take a copy of your destination data that intersects your source
    * compare the two for changed values - update as required
    * insert any new src rows

    here's what it looks like in ssis

    * create a land table. this is a 1:1 copy of your src data schema. Used only for the ETL process. This is truncated during the ETL
    * create a staging table that is a 1:1 schema copy of your destination table. This is only used for the ETL process. This is truncated during the ETL
    * create a destination table. The data is persisted in this table once the load has finished

    all tables need to be in the same db or at the very least on the same server - you need to do joins/filters against each tables

    consider using datetime2 instead of date if you have intraday loads

     make 1:1 copy of src system data into the land table. this is truncated each load. it can either be a full load or incremental

    the data flow looks like this

    

    select rows
    from land
    order by business key

    select d.rows
    from destination d
    inner join land l
    on d.business key = l.business key
    order by d.business key

    you now have all your new data and the rows in the destination that may get updated

    use the 3rd party checksum ssis transform

    create a checksum on the columns tha you want to track for changes - a change will trigger an update. do this for both land and dest pipelines

    use a merge join transform to left join the src and destination data flows
    keep the src columns, the src checksum, the destinaion surrogate key, the destination checksum


    if task
     - if d.surrogate key is null - new row - insert into destination
     - if d.surrogate key is not null and checksums not equal - updated row - insert into staging
     - if d.surrogate key is not null and checksums equal - existing row unchanged - do nothing (or send to a dummy task so you can log the number of rows


    now do a control flow update statement to merge your stage data into your destination table

    update d
    from destination d
    inner join staging s
    on d.surrogate key = s.surrogate key



    this pattern will scale well - much better than using the SCD task 

    it's database agnostic - doesn't matter if it's oracle, mssql, postres, mysql etc

    It minimises the work done. the merge wont block because the data is sorted @ the database

    and the visual data flow is easy to follow


    Jakub @ Adelaide, Australia Blog


    • Edited by jakubk Monday, July 20, 2020 1:09 AM
    Monday, July 20, 2020 1:05 AM
  • Julie

    >>Should I consider truncate and reload for staging table? but I will miss actual data in that case and need to rely on >>>file if anything required.

    Sure , truncate the data for staging table  and populate it with the inclement loading , then insert into original table 

    >>ETL table is going to loaded based on last loaded date. Should I introduce a batch id to identify record if suppose >>>the data load happen more than once a day?

    Better if you have a timestamp (datetime) column to catch the latest order....

    declare @dt DATETIME

    SELECT @dt=MAX(dt) from tbl 

    Then

    select * from tbl where dt >@dt --populate staging table 

    >>We are using stored procedure to insert ETL table. Should we go with SSIS? Which is the best approach?

    I prefer SSIS ad it has (OLEDB source) Fast Load option which rapidly improve performance 

    >>Report tables are aggregates on multiple level except one which is referring the etl table to get details based on >>input parameter. Using Store procedures to populate all reporting tables. 

    If aggregation takes time try create a column store index on the filed you aggregate 


    Best Regards,Uri Dimant SQL Server MVP, http://sqlblog.com/blogs/uri_dimant/

    MS SQL optimization: MS SQL Development and Optimization
    MS SQL Consulting: Large scale of database and data cleansing
    Remote DBA Services: Improves MS SQL Database Performance
    SQL Server Integration Services: Business Intelligence

    Monday, July 20, 2020 4:11 AM
  • Thanks Jacub! But 3rd party tools are not allowed for us.

    Monday, July 20, 2020 11:21 AM
  • thats a shame. Here is the one i was talking about http://www.sqlis.com/post/checksum-transformation.aspx

    you can duplicate it in a script transform task though, it just uses .net to create a hash

    https://www.mssqltips.com/sqlservertip/3170/using-hash-values-in-ssis-to-determine-when-to-insert-or-update-rows/


    Jakub @ Adelaide, Australia Blog

    Monday, July 20, 2020 1:14 PM
  • Thanks Dimant!
    Monday, July 20, 2020 2:59 PM
  • You can use HASHBYTES() instead to detect changes.  I do not suggest using CHECKSUM() because on wide rows, it can create the same checksum value for small changes.

    https://blog.greglow.com/2018/07/02/sql-finding-rows-that-have-changed-in-t-sql-checksum-binary_checksum-hashbytes/

    Sometimes it is more work to figure out what changed, than to just merge all the records.  I do not recommend using hashbytes on small tables, or tables which you don't care about "last changed date".

    Monday, July 20, 2020 4:03 PM
    Answerer