Our SQL database was claimed to be OLAP database that includes billions of sales records. Records are inserted into the database once everyday. To improve performance of data retrieval, the sales records are denormalized and stored into flat sales table, which is provided to users. The data file size is in terms of TB.
To update records in the fact table, there is scheduled job that drops the flat tables and rebuild table/index during weekend. This takes quite a number of hours.
We are reviewing this process and see if we only update the records that required to change. In this way we do not need to drop table and spend hours to rebuild them. However, we are not sure whether this change will degrade performance on data retrieval due to fragmentation/etc.
We need some advice on whether to do this change, and want to know the best practice on handling outdated records in denormalized table. Thank you.
I would like to ask how much data provided to users is used by users in that flat table, is it all days or just couple of past days...in that basis...
In that case you can truncate and re-insert only few days data in flat table. that will save time compared to big table drop and create/insert
If thats not feasible, I would also recommend to create partitioned table based to Business Date and update those regularily than drop and create.
Thanks and Regards, Prajesh Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.
- Proposed as answer by Prajesh Thursday, July 25, 2013 9:57 AM
Thanks for reply.
The table includes data within 3 years. It includes columns such as product description.
Do you mean we could do data refresh based on partition (with Business Date)? If we refresh data in this way, will it slow down the data retrieval due to fragmentation?
Our SQL server is standard version right now and cannot perform partitioning. Fortunately, upgrading to enterprise version is not difficult if there is solid reason.
Product description .. shouldn't that be in a dimension table?
going off what Prajesh mentioned, you can load your data into a staging table that mimic your real fact table. align the indexes between these two tables, create the partition scheme, and load the data using option (fast 10000) in your query. 10k is the default buffer size for ssis data flow task. once data is loaded into the staging table, switch the data into the real fact table, and finally truncate your staging table.
I think you should evaluate your table design. A correctly designed starschema using a well defined clustered key will allow you to update/insert/delete records without impacting fragmentation and guaranteeing retrieval respons times
Geert Vanhove DCOD ------ http://geertvanhove.wordpress.com/ ----------- Please click the Mark as Answer or Vote As Helpful if a post solves your problem or is helpful!
Look at the reporting tables and see if you can use TSQL MERGE to update only the more recent rows, and consider using partitioning to store older rows separately, so they are not impacted by the nightly or weekly updates.
Hard to say more without DDL.
I would redesign the whole thing. I couldn't quite figure out if you meant you have 1 single table that has all your data in. I would go back to basics and try and figure out how your source system stores the data.
So questions you should be asking, how can I identify records that have changed or need inserting? Is it transaction, is there a way to use modified date on the source system (does the column exist, is it accurate).
Am I interested in history capture? Assuming if you re-load this table every week you are only seeing the current view of the data.
There are so many different ways to do this. You could build an ODS and figure out what has change. You could build fact and dimensions, you could go immon, you could build a data warehouse and then load from there into this large table.
You could upgrade SQL Server, so you have partitioning.
I think you need to understand the system and the reporting requirements, no one on this forum is going to be able to help really. It is such a big question that covers, architecture, design, performance, business requirements ...
Really you only want to load what has changed.