none
Change Data Capture (CDC) cleanup job only removes a few records at a time

    Question

  • Hi everyone, 

    Summary of the problem: I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.

    Entire explanation of what I did: I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
    My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
    After a suggestion I just tried to see what happened when I set up a schedule again in the job (with a retention time of an hour this time), like how CDC is meant to be used in general. I had 3 records in that table. I changed 2 records at the same time and a few minutes later another record. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. I set another schedule again and then it deleted another record (the 2nd record of the 2 that I updated at first). Why aren't they purged all at once? What am I doing wrong?

    I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.

    Thanks,

    Kim

    PS. I made a tread like this in sql server 2014 yesterday, but it has like 10 views or so. I  thought that this subforum would be appropriate as well. If it's not allowed, I can delete the other thread ofcourse.  

    Thursday, August 14, 2014 9:43 AM

Answers

  • Hi Kim,

    I've tested and refered to the definition of the clean job stored procedure [sys].[sp_MScdc_cleanup_job].

    You are wondering why it only delete a few records even retention is set to 1 minute. Here is a simple logic for how clean job determine which record should be deleted:

    [sys].[sp_MScdc_cleanup_job] will use a low_water_mark_time as flag, then all commit_time less than low_water_mark_time will be deleted. How to get a low_water_mark_time:

    1.The clean job will find the latest commit_time from cdc.lsn_time_mapping, then minus retention time (here is 1 minute as you set) as a temp low_water_mark_time.

    2.Find the largest commit_time less than or equal the temp low_water_mark_time as the final low_water_mark_time. Below query can help you to find a low_water_mark_time in your environment, you can test (-1 is the retention):

    select sys.fn_cdc_map_lsn_to_time
    (sys.fn_cdc_map_time_to_lsn('largest less than or equal',(dateadd(minute, -1, (select max(tran_end_time) from cdc.lsn_time_mapping)))))

    For the deleting records, clean job use sys.sp_cdc_cleanup_change_tables to do it. Here is the part of the code, @p2 is the low_water_mark_time we get above:

    set @stmt = N'delete top( @p1 ) ' +

                        N' from ' + @change_table +

                        N' where __$start_lsn < @p2 '

    Now let's have a simulation:

    Suppose you have three records in cdc.dbo_table_CT which commit_time are 10:00, 10:03, 10:06. You execute the clean job at 10:08. Then low_water_mark_time should be 10:03 (10:06 minus 1min, then choose the largest less than or equal one), from above code for deleting, only 10:00 will be deleted. 

    I admit the retention is somehow confuse here. Hope above explanation can help you understand how this parameter is used.

    You may wonder what if I only have 10:00, 10:03, 10:06 three records and execute clean job at 11:00? From above logic, the low_water_mark_time should be 10:03 as well. Will it still only delete the 10:00? Actually you monitor cdc.lsn_time_mapping for some time and you can find there will be a record with tran_id = 0 inserted every 5 minutes. This design is to prevent above condition happening. So if you set retention to 1 minute, a record will exist here for at most 10 minutes. :)



    • Edited by SQL Team - MSFT Thursday, August 21, 2014 3:14 AM
    • Marked as answer by KimLee23 Thursday, August 21, 2014 8:55 AM
    Thursday, August 21, 2014 3:08 AM

All replies

  • Hi KimLee,

    I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.

    Regards,
    Sofiya Li


    Sofiya Li
    TechNet Community Support

    Monday, August 18, 2014 7:05 AM
    Moderator
  • Hi Kim,

    I've tested and refered to the definition of the clean job stored procedure [sys].[sp_MScdc_cleanup_job].

    You are wondering why it only delete a few records even retention is set to 1 minute. Here is a simple logic for how clean job determine which record should be deleted:

    [sys].[sp_MScdc_cleanup_job] will use a low_water_mark_time as flag, then all commit_time less than low_water_mark_time will be deleted. How to get a low_water_mark_time:

    1.The clean job will find the latest commit_time from cdc.lsn_time_mapping, then minus retention time (here is 1 minute as you set) as a temp low_water_mark_time.

    2.Find the largest commit_time less than or equal the temp low_water_mark_time as the final low_water_mark_time. Below query can help you to find a low_water_mark_time in your environment, you can test (-1 is the retention):

    select sys.fn_cdc_map_lsn_to_time
    (sys.fn_cdc_map_time_to_lsn('largest less than or equal',(dateadd(minute, -1, (select max(tran_end_time) from cdc.lsn_time_mapping)))))

    For the deleting records, clean job use sys.sp_cdc_cleanup_change_tables to do it. Here is the part of the code, @p2 is the low_water_mark_time we get above:

    set @stmt = N'delete top( @p1 ) ' +

                        N' from ' + @change_table +

                        N' where __$start_lsn < @p2 '

    Now let's have a simulation:

    Suppose you have three records in cdc.dbo_table_CT which commit_time are 10:00, 10:03, 10:06. You execute the clean job at 10:08. Then low_water_mark_time should be 10:03 (10:06 minus 1min, then choose the largest less than or equal one), from above code for deleting, only 10:00 will be deleted. 

    I admit the retention is somehow confuse here. Hope above explanation can help you understand how this parameter is used.

    You may wonder what if I only have 10:00, 10:03, 10:06 three records and execute clean job at 11:00? From above logic, the low_water_mark_time should be 10:03 as well. Will it still only delete the 10:00? Actually you monitor cdc.lsn_time_mapping for some time and you can find there will be a record with tran_id = 0 inserted every 5 minutes. This design is to prevent above condition happening. So if you set retention to 1 minute, a record will exist here for at most 10 minutes. :)



    • Edited by SQL Team - MSFT Thursday, August 21, 2014 3:14 AM
    • Marked as answer by KimLee23 Thursday, August 21, 2014 8:55 AM
    Thursday, August 21, 2014 3:08 AM
  • Hi, thanks for your reply! In the meantime I had to find another to work with and made another solution to work with CDC in the way I wanted it to but I always kept wondering about this question. Thank you for clearing this up, the example explains it well.
    Thursday, August 21, 2014 8:55 AM