none
Storing and retrieving historic data

    Question

  • Hi,

    I am looking at making my database historic and would like some pointers. Data older than 1 year does not need to be kept so every 12 months, data is "recycled".

    So far I have come up with the following using DATEPART and all data is in the same table (tables simplified):

    1. When I insert records:

    INSERT INTO [Table] VALUES (newid(), @FKID, @Value, GETDATE())
    

    1. When I update records:

    UPDATE [Table] SET [Column] = @Value WHERE [FK] = @FKID AND (DATEPART(month, GETDATE()) = DATEPART(month, [TimeStamp]))
    

    2. When I select records:

    SELECT * FROM [Table] WHERE [FK] = @FKID AND (DATEPART(month, GETDATE()) = DATEPART(month, [TimeStamp]))
    

    3. WHEN I delete records:

    DELETE FROM [Table] WHERE [FK] = @FKID AND (DATEPART(month, GETDATE()) = DATEPART(month, [TimeStamp]))
    

     

    Is this efficient? I was looking at partitioning, but I am not sure that it would be feasible as I want to recycle data every 12 months. Or is there a better way to do it?

    Regards,

    James

    Monday, September 13, 2010 1:51 PM

Answers

  • The statements themselves are not efficient. You don't want to perform calculations (functions) on TimeStamp column, especially if you have an index on it.

    The general suggestion for such scenario as yours is to move old data to a different table (may be in a different database) and keep only last year data in the current table. This way you preserve historical data and only have fresh data.

    Let me know if you need a more detailed response with the logistics of the process.

    Take a look at this blog post to understand why your statements are inefficient

    Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code

     


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, September 13, 2010 2:01 PM
  • First of all, you need to tell us what are you trying to accomplish with "historical" and "production" data. If you are doing that because of performance issues or want to be able to seamlessly purge/archive the data, table partitioning would work perfectly. You will be able to switch historical data to the different table and purge/archive it instantly.

    Speaking of criteria for the partition function - it's hard question. Straightforward approach is using date column. Obviously you'll need to include this column to all indexes in order to make them aligned. Another approach (if you're using identity PK) is to partition based on ID intervals. At the time when you want to create another partition, you can grab the current max ID and split partition function based on that. You need to analyze/change the queries to device what method do you want to use. If you want queries to be efficient and limit search to the partition, you have to include the predicate which tell Optimizer where data is resigned (i.e. either predicate on date if you split it with date, or by ID, if you do it by ID).


    Thank you!

    My blog: http://aboutsqlserver.com

    Monday, September 13, 2010 2:39 PM

All replies

  • In all those queries, you can replace the part "(DATEPART(month, GETDATE()) = DATEPART(month, [TimeStamp]))" with below condition. 

    [TimeStamp] >= DATEADD(month,datediff(month,0,getdate()),0) and 
    [TimeStamp] < DATEADD(month,datediff(month,0,getdate())+1,0)
    


    Datepart(Month,[TimeStamp]) will not allow the optimizer to efficiently use the indexes... u can replace that condition with the one i specified in all those queries.......

    Monday, September 13, 2010 1:56 PM
  • The statements themselves are not efficient. You don't want to perform calculations (functions) on TimeStamp column, especially if you have an index on it.

    The general suggestion for such scenario as yours is to move old data to a different table (may be in a different database) and keep only last year data in the current table. This way you preserve historical data and only have fresh data.

    Let me know if you need a more detailed response with the logistics of the process.

    Take a look at this blog post to understand why your statements are inefficient

    Only In A Database Can You Get 1000% + Improvement By Changing A Few Lines Of Code

     


    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, September 13, 2010 2:01 PM
  • If you do not mind can i know why are you camparing the getdate() of month with the TimeStamp month?

    I think you want to remove the data from the base table to historical table then you need to use condition little bit diffrent .

    Monday, September 13, 2010 2:01 PM
  • Thanks for the replies.

    The comparison with GETDATE() was done to get the data relating to the current month.

    Detailed information would be helpful! I am ok with having a history table in a different file group, however I want to minimize maintenance and ideally automate the moving of data to the history table. Would there be performance implications with bulk-moving data from the base table to the history table?

    Regards,

    James

    Monday, September 13, 2010 2:07 PM
  • You can have a job running every night to move records older than 1 year to a new database. The original run of this job will be slow, but then it will only move a few transactions every night (or nothing to move at all).
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, September 13, 2010 2:18 PM
  • The historic information is going to be for an audit trail on the records in the tables. Because of this, there needs to be a persistent unique ID. Currently the ID is my PK column, so I would not be able to enter duplicate entries for the historic data. What would be the best way to approach this problem without too many changes to the database?

    Monday, September 13, 2010 2:28 PM
  • I'm not sure I understand the problem. Your current table has unique ID. When you move the data into historical table, you preserve the structure of the original table and the ID of the record (from the main table). You can, of course, add an extra ID to the historical table or the original ID should be enough (as it's unique already).
    Premature optimization is the root of all evil in programming. (c) by Donald Knuth

    Naomi Nosonovsky, Sr. Programmer-Analyst

    My blog
    Monday, September 13, 2010 2:35 PM
  • You need detailed information for what? regarding  the suggestion of not to use functions on datetime column or performance implications?? anyway, i will give you explanation on both.

    First Execute the below script(Create table,insert statement, and index creation) then finally execute the 2 queries seperately...and check the queries execution plan(2% and 98%). You can see the query which checks the range performs more better than the one which uses Month Function(which is same like datepart(month)).

    If an index is there on datetime column, when you wrap that column in an expression like Month(ColumnName), sql optimizer cannot understand and it will scan entire index.  If you see the range query, optimizer can understand that conditions and check only relevant part of index.

    I would suggest you to read the Naomi's link there..

    create table testindexes
    (
    	Id int identity(1,1) primary key,
    	MyDate datetime
    )
    ;with Numbers as
    (
    	select 1 as Num union all select 2 union all select 3 union all select 4 union all select 5 union all select 6
    	union all select 7 union all select 8 union all select 9 union all select 0
    )
    insert into testindexes
    select DATEADD(day,ABS(BINARY_CHECKSUM(newid())%100000),0) from Numbers N,Numbers N1,Numbers N2,Numbers N3, Numbers N4
    
    create index idx_mydate on testindexes(MyDate)
    
    select * from testindexes where MONTH(mydate) = 1 and YEAR(mydate) = 2005
    select * from testindexes where mydate >= '1/1/2001' and MyDate < '1/2/2005'
    
    
    

     

    Coming to other question of performace implications of moving large data from orginal table to history table, Usual it will be performed by using SQL Jobs, which will be scheduled at night. Performance implication in case of deletions will come because of fragmentations. You can reduce fragmentation by choosing a better clustering key. As your current table is more date-based, usually clustering key will be  inserted date and some unique id for that day records. So,  In case of deletions of large data before particular date, it will delete the entire page, instead of random records in random pages. So, There will not be fragmentation problem also.... 

    Monday, September 13, 2010 2:38 PM
  • First of all, you need to tell us what are you trying to accomplish with "historical" and "production" data. If you are doing that because of performance issues or want to be able to seamlessly purge/archive the data, table partitioning would work perfectly. You will be able to switch historical data to the different table and purge/archive it instantly.

    Speaking of criteria for the partition function - it's hard question. Straightforward approach is using date column. Obviously you'll need to include this column to all indexes in order to make them aligned. Another approach (if you're using identity PK) is to partition based on ID intervals. At the time when you want to create another partition, you can grab the current max ID and split partition function based on that. You need to analyze/change the queries to device what method do you want to use. If you want queries to be efficient and limit search to the partition, you have to include the predicate which tell Optimizer where data is resigned (i.e. either predicate on date if you split it with date, or by ID, if you do it by ID).


    Thank you!

    My blog: http://aboutsqlserver.com

    Monday, September 13, 2010 2:39 PM
  • Thats just about answered my confusing question! Thanks.

    Regards,

    James

    Monday, September 13, 2010 2:39 PM