none
Why is DPM 2012 R2 splitting this Co-located Protection Group Set across three different tapes? RRS feed

  • Question

  • I have 10 protection groups that ran Saturday night.  I have put information about them in the following table:

    I extracted the data in the table using Get-DPMProtectionGroup CmdLet and the job information from the Monitoring page of the DPM admin console.  As you can see from the tape barcode, DPM wrote the first four PGs to one tape, then started a second tape, and finally failed when starting a third because it had run out of free tapes.

    All of the protection groups are assigned to the same "Co-Located Protection Group Set" in the "Tape Optimization Setup" dialog box.

    Prior to the jobs running, I painstakingly ensured that each of the Protection Group's "Select Long-Term Goals" and "Select Library and Tape Details" screens are identical as follows:

    Prior to running the jobs I had already disabled autoclean on the tape library.

    I have checked to see if the tape drive is reporting IO error 0x8007045D using the method described in this blog post: http://blogs.technet.com/b/dpm/archive/2012/05/14/things-you-can-do-to-help-data-protection-manager-utilize-your-tapes-full-capacity.aspx

    There seems to be no mention of 0x8007045D or -2147023779 in any of the MSDPM*.Errlog files.

    These are LTO6 tapes with a capacity of 2.5 TB each.  The total data for all the jobs put together is less than 300GB.

    Why is DPM 2012 R2 splitting these protection groups across three different tapes? 

    What else should I check for that might cause DPM to start a new tape?


    • Edited by alx9r Monday, August 18, 2014 8:53 PM
    Monday, August 18, 2014 8:52 PM

All replies

  • Hi,

    Did you check the DPMRA*.Errlog files ?  Did you add the BufferQueueSize and BusyRetryCount registry entries ?


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, August 19, 2014 12:57 AM
    Moderator
  • Hi Mike,

    Thanks for the quick response.  I did check the DPMRA*.Errlog files for "0x8007045D" and "-2147023779".  There are no instances of those strings in the DPMRA*.Errlog files. Your blog post indicates adding BufferQueueSize and BusyRetryCount as potential cures for IO_DEVICE_ERROR 0x8007045D.  Since IO_DEVICE_ERROR 0x8007045D is not mentioned in any logs, I haven't added BufferQueueSize or BusyRetryCount. 

    Should I add the BufferQueueSize and BusyRetryCount anyway?  If so, how do I know if it's helping if there were no "0x8007045D" errors in the first place?

    Is there something else I should be looking for in the logs that would indicate why DPM switched tapes?

    Thanks for your help,

    Alex

    Tuesday, August 19, 2014 4:53 PM
  • Hi,

    The registry entries are preventative in nature so they will not cause any harm.

    You can see if the tape drive is reporting end of media by performing this search.

     Find /i "0x8007044C" dpmra*.errlog


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Tuesday, August 19, 2014 5:09 PM
    Moderator
  • Hi Mike,

    I must have made a mistake earlier.  I retraced my steps and searched again for "0x8007044C" and "0x8007045D" in each of DPMRA*.errlog and MSDPM*.errlog.  There are 46kB worth of lines containing "0x8007045D" in the DPMRA*.errlog.  The other three searches yielded no results.  The "0x8007045D" errors all seem to look something like this:

    0C84 1804 08/17 06:11:15.676 22 genericthreadpool.cpp(537) [0000000000D87210]  WARNING Failed: Hr: = [0x8007045d] Thread 6148 received ERROR

    0C84 1C9C 08/17 06:11:15.676 18 bufferedmediareader.cpp(782) [0000000000DA0A30]  WARNING Failed: Hr: = [0x8007045d] CBufferedMediaReader - Received unknown error. number of bytes - 0, index - 5.

    The strange thing is that all of the occurrences are from either 08/12 or after 06:00 on 08/17 which is after all the backup jobs had finished.  None of the "0x8007045D" timestamps seem to match up with 08/17 00:15 or 08/17 02:38 which are the times when DPM switched to new tapes.

    I have added BufferQueueSize and BusyRetryCount to the registry.  The co-located protection groups in question are scheduled to backup to tape on Saturday night.  Is there any way to run Saturday's backup jobs early without manually changing the schedule of all of the protection groups?

    Alex


    • Edited by alx9r Tuesday, August 19, 2014 7:01 PM
    Tuesday, August 19, 2014 6:52 PM
  • Hi,

    All the times logged in the DPM Errlog files are in GMT time - so based on local time zone, you need to subtract from the logged time.   Lets see if the registry entries worked, they should unless you have a hardware problem.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, August 19, 2014 7:48 PM
    Moderator
  • Hi Mike,

    Can you confirm the correct path to the BusyRetryCount registry entry?  Your blog post mentions three different spellings of the final level of the hierarchy:

    • "Storport"
    • "StorPort"
    • "StorpPort"

    I'm still seeing 0x8007045d errors.  I'd like to ensure the registry entry I created is being picked up.  Is there a way to see that happening in the logs?

    Alex

    Monday, August 25, 2014 4:03 PM
  • Hi

    Gosh - good catch - I don't believe the correct entry is case sensitive - however the correct spelling is StorPort.

    The DPMRA logs are only going to detail that the tape drive reported an IO error.  The BufferQueueSize is also important in eliminating the errors.  If you have multiple tape drives, you can try disabling all but one at a time and see if it's just a particular drive, or if it's a bus problem.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Monday, August 25, 2014 5:26 PM
    Moderator
  • Hi Mike,

    I added another free tape and invoked "Resume Tape Backups..." for each of the failed protection groups.  This time there were no 0x8007045d errors.  Strangely, DPM completed backup for exactly one PG, produced the warning "Job waiting for tape" and, most bizarrely, marked the new tape as "offsite ready". 

    All of the resumed jobs should have been co-located onto a single tape.  I have a "Write period" of 1 day for this PG set. Is DPM refusing to co-locate PGs and marking this new tape as "offsite ready" because it is more than 1 day after the scheduled back up time?  If so, how do you ever get resumed backup jobs to co-locate?

    Alex

    Monday, August 25, 2014 9:44 PM
  • Hi,

    Yes, most likely that is what happened.  For D2D2T - the recovery point creation time is that of the last disk based recovery point at the time the tape backup.  If you retry a tape backup, we actually mount the shadow copy for the disk based backup from the day the tape backup was initiated.   If that disk based backup is already a day old, then we cannot write it on the same tape because it would exceed the Write Period.   You would need to change the write period so the retried backup jobs could co-locate - then change it back in time for the new round of backups.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Monday, August 25, 2014 11:18 PM
    Moderator
  • Thanks for clarifying that behavior.  I just tested with a longer write period and confirm that invoking "Resume tape backups..." on multiple PGs in the same PG set results in them being collocated.

    I don't understand the philosophy behind this behavior though.  Isn't the point of "Write Period" to limit collocation of shadow copies to those that were created during a limited period of time?  That way you end up with all of the shadow copies from say, last weekend, on a single tape.  Instead, DPM seems to be limiting collocation of shadow copies to those that can be successfully written to tape during a limited period of time.  That behavior basically guarantees tape count amplification as soon as a tape backup job fails for any reason.  Am I missing something here?

    Alex

    Tuesday, August 26, 2014 5:01 PM
  • Hi Mike,

    I'm still finding that DPM often mysteriously splits backups across multiple tapes and immediately marks those tapes as "Offsite Ready".  Here is what I have confirmed:

    • "BufferQueueSize" has value "3" at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent
    • The tape drive is enumerated at SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100
    • "BusyRetryCount" has value "250" at HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Enum\SCSI\SEQUENTIAL&VEN_IBM&PROD_ULT3580-HH6\1&1C121344&0&000100\Device Parameters\StorPort
    • All PGs are members of the same PGSet with a Write Period of 6 days and "Don't allow backups of different retention periods to co-locate on same tape" disabled. 
    • All PGs have identical TapeBackupOptions (I have compared them using Get-DPMTapeBackupOption
    • All PGs have identical weekly and monthly schedules (I have compared them using Get-DPMPolicySchedule)
    • Find /i "0x8007044C" *.errlog reveals no errors
    • Find /i "0x8007045D" *.errlog reveals errors from September but no errors since then.

    Is there any other place I should be looking for errors that DPM might be interpreting as "End of Tape"?

    Are there reasons other than errant "End of Tape" errors that could cause DPM to abandon writing a tape part-way?

    Thanks for your help,

    Alex

    Monday, February 2, 2015 10:46 PM
  • Hi,

    You can look in the DPMRA logs for the code 0x8007044C which is  "The physical end of the tape has been reached."  See if they match times when tapes are spanned prematurely or marked offsite ready.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, February 3, 2015 2:24 AM
    Moderator
  • Hi Mike,

    There are no 0x8007044C errors.  I ran the following search:

    cd "C:\Program Files\Microsoft System Center 2012 R2\DPM\DPM\Temp"
    Find /i "0x8007044C" *.errlog > c:\users\un1a\desktop\0x8007044C.log
    

    The resulting 0x8007044C.log file shows no error lines, only the names of each *.errlog file:

    ---------- DPMRA0.ERRLOG
    ...
    ...
    ---------- DPMRACURR.ERRLOG

    Is there any other place to look for clues to why DPM isn't filling tapes?

    Alex

    Tuesday, February 3, 2015 3:36 PM
  • Hi,

    Assuming you are making daily disk based RP's - I see no reason why DPM would not be able to fill tapes when it comes time to doing the weekly tape based backup.  Would probably need to enable verbose logging during next scheduled tape backup time and see why DPM is selecting new tapes instead of appending.   Please open a support case to help track this down.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, February 3, 2015 9:43 PM
    Moderator
  • Hi Mike,

    I opened a case with Microsoft Support Services. It is SR number 115020312355639.

    Thanks for your help,

    Alex

    Tuesday, February 3, 2015 11:49 PM
  • Hi Alex,

    Have you been able to resolve this?

    I'm having the exact same problem as yourself (except I'm using LTO5 tapes) and have tried all the steps listed here and on that link as well.

    What would normally take 4 tapes to complete now takes 7-8 tapes .

    Using DPM 2012 R2, Windows Server 2012 R2 and IBM 3573 Tape Library.

    I'm desperately looking for a solution to this as I keep running out of backup tapes and having to purchase more and more.

    Please let me know if you find a solution.

    Thanks,

    Dallas

    Friday, February 6, 2015 9:43 AM
  • Hi Dallas,

    I am in the same boat: What should fit on one tape suddenly spread two, then three, then four tapes.  I've bought more tapes, but I'm afraid that is just a stop-gap, since it seems that the more workloads I add, the more tapes DPM spreads the backups across. 

    My Microsoft Support Services case has been open for about four days. I do not have any progress to report. I will report back here if there is progress I can share.

    Alex

    twitter: @alx9r

    Friday, February 6, 2015 3:11 PM
  • Hi,

    Please make sure each protection group is part of PGSet and that the write period and expiry tolerance are high enough - you can set high then reduce once you confirm those settings are related to the problem.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, February 6, 2015 4:30 PM
    Moderator
  • Hi Mike,

    My Write period and Expiry Tolerance are 6 days and 7 days, respectively.  Based on the mechanism described in your blog post and here, those values should be more than adequate to permit a tape to fill.  Are you suggesting that they should be set even higher?  If so, what values should I set them to?

    Alex

    Friday, February 6, 2015 4:55 PM
  • Hi,

    The co-location settings in this BLOG does not really apply to DPM 2012 and later because the PGSet feature was introduced and replaced the mechanisms described in the blog.  As a troubleshooting step, yes I would increase those values and see if it makes a difference.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, February 6, 2015 5:03 PM
    Moderator
  • OK I have changed both Write Period and Expiry Tolerance to 12 months.

    Alex

    Friday, February 6, 2015 5:23 PM
  • Hi Mike,

    Here's the latest:

    • 2015-02-06: changed both Write Period and Expiry Tolerance to 12 months
    • 2015-02-07: enabled verbose logging by setting TraceLogLevel to 0x43e as per this technet article
    • 2015-02-08 0300: weekly tape backup began with verbose logging
    • 2015-02-08 0926: weekly tape backup completes having used only a single tape

    Alex

    Monday, February 9, 2015 4:34 PM
  • Hi Dallas,

    Take a look at my latest post below -- I made some changes to my DPM config (increased WritePeriod and ExpiryTolerance, enabled verbose logging) and it collocated everything on one tape for our weekly. You might try doing the same thing to see if it makes a difference. Not exactly a long-term solution, but it might take some load off your tape tape budget if it works.

    Alex

    Monday, February 9, 2015 4:42 PM
  • Hi Alex,

    Thanks for that, I'll give it a go!

    I'm glad it's worked for you. Fingers crossed it works for me too!

    Dallas


    Thursday, February 12, 2015 3:57 AM
  • Hi Alex,

    I took a look at your verbose logs, unfortunately we don't log the decisions that DPM makes when determining what tape to select for the next backup.  There are a lot of decisions based on date of the disk based recovery point that is about to be written to tape, what the PGset settings are, the expiry of the data set that is about to be written, and what is already written to the tape.  So without seeing these decision points in the log, I cannot specifically tell you why your original PGset settings were problematic. 


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, February 12, 2015 2:28 PM
    Moderator
  • Hi Mike,

    Thanks for taking a look at the logs.  I just wanted to point out that there were actually two changes made just before collocation succeeded:

    1. Write Period and Expiry Tolerance were increase
    2. Full Verbose logging was enabled

    This technet article states that Full Verbose logging affects performance.  If the problem with collocation is caused by a race condition then it is possible that enabling logging actually caused collocation to succeed. At this point there doesn’t seem to be any evidence pointing to which of the two changes resulted in collocation succeeding.

    I propose I run the following tests to differentiate the two:

    Test #1 - Collocation Confirmation

    Repeat last weekend's test to confirm that successful collocation wasn't an anomaly.  Configuration:

    • full verbose logging: Enabled
    • Expiry Tolerance: 1 year
    • Write Period: 1 year

    Test #2 - Logging Only

    Test to see whether enabling Full Verbose logging enabled collocation to succeed.  Configuration:

    • full verbose logging: Enabled
    • Expiry Tolerance: 7 days
    • Write Period: 6 day

    Thanks for your help,

    Alex

    Thursday, February 12, 2015 3:05 PM
  • Hi,

    I would like to propose that you just disable verbose logging since as I mentioned earlier, the logs do not contain the desired information and yes, verbose logging will effect performance.  That way the only change made were to the expiry tolerance and write period.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, February 12, 2015 3:12 PM
    Moderator
  • Hi Mike,

    Thanks for the guidance.  Based on it, here is the test configuration I'll be running on 2015-02-15:

    Test #3 - Long Expiry and Write Period, No Verbose Logging

    Configuration:

    • full verbose logging: Disabled
    • Expiry Tolerance: 1 year
    • Write Period: 1 year

    I'll post the outcome as soon as I have it.

    Thanks for your help,

    Alex

    Friday, February 13, 2015 9:01 PM
  • Hi Mike,

    It looks like my installation is affected by the console crash problem after the UR5 update. I was able to set your recommended configuration before the weekly tape backups, and I suspect the test ran, however, it'll be a couple days before I have a chance to dig into the fix and regain console access to my DPM server.  I'm hoping there is a Microsoft-supported fix to the UR5 console crash problem that emerges by then.

    Alex


    • Edited by alx9r Monday, February 16, 2015 5:53 PM
    Monday, February 16, 2015 5:52 PM
  • Hi Mike,

    I resolved that console crash problem.  It looks like the last test indeed completed with the following configuration:

    • full verbose logging: Disabled
    • Expiry Tolerance: 12 Months
    • Write Period: 12 Months

    Result: Collocation succeeded.

    This indeed points to the Expiry Tolerance and Write Period settings being at fault.  Please advise what test I should run next.

    Alex

    Friday, February 20, 2015 1:22 AM
  • The February 22nd weeklies backups ran with following configuration:

    • full verbose logging: Disabled
    • Expiry Tolerance: 1 Month
    • Write Period: 1 Month

    Result: Collocation Succeeded.

    Wednesday, February 25, 2015 9:38 PM
  • The March 1st weekly backups ran with the following configuration:

    • full verbose logging: Disabled
    • Expiry Tolerance: 2 weeks
    • Write Period: 2 weeks

    Result: Collocation Succeeded

    Monday, March 2, 2015 4:58 PM
  • In case this helps someone else, here is what I learned at the end of all this.  There is one factor that can apparently prevent co-location that I have not seen documented anywhere yet: Data sources that have not been synchronized in a while.

    Here's how I understand how this works:

    For DPM to protect a data source, it must first "sync" that data source with DPM's own copy of that data source on disk.  DPM's tape backups are always disk-to-disk-to-tape, not disk-to-tape.  When a tape backup of a data source occurs, DPM uses the latest sync'ed copy of the data source*.  If the latest copy of one data source is older enough than another data source, then colocation is prevented because the Expiry Tolerance prevents the two data sources from being eligible to be put on the same tape. 

    Exactly how the expiry tolerance logic works is being applied in any particular situation is difficult to determine because it does not seem to turn up in any of the logs.  Increasing the Write Period and the Expiry Tolerance until colocation succeeds, and then reducing them individually in small increments might be the best way to arrive at good values for your situation.

    Alex

    * I think this might actually be "the latest copy of the data source prior to the scheduled tape backup."

    Tuesday, March 10, 2015 10:57 PM
  • Hi Alex, I know this thread is old but I am having the same issue. Does setting a write period of 2 weeks mean that tapes will not be marked offsite ready until 2 weeks after they were written to?
    Sunday, May 29, 2016 6:04 AM
  • Hi tpullins,

    You asked

    Does setting a write period of 2 weeks mean that tapes will not be marked offsite ready until 2 weeks after they were written to?

    Almost.  Here's what I think is correct: Setting a write period of 2 weeks means that a tape is written to as long as the recovery point being written is dated within 2 weeks of all other recovery points on the tape.  Note that by "dated" I mean the date the recovery point was backed up from the originating server; the date it was written to tape seems to be irrelevant.  Also note that as soon as data that does not meet that rule is encountered, a new tape is selected and the old one seems to get abandoned.  The abandonment seems to be permantent.  That is, even if another recovery point is encountered whose date falls within the write period of the abandoned tape, that recovery point is written to the new tape.

    Hope that helps,

    Alex

    Sunday, May 29, 2016 10:53 PM
  • In my scenario all of my PGs have identical tape retention settings and disk backups are created at least once daily, write period is 2 days so I figure it should work as is. I will try increasing the values, however can you confirm that this will cause DPM to not mark the collocated tapes ready for offsite until after the 2 week write period setting? Reason is we rely on the built in tape management report for sending/receiving tapes offsite on a weekly basis. I'd rather not keep tapes unnecessarily onsite after the weekly backups complete.
    Monday, May 30, 2016 12:06 AM
  • can you confirm that this will cause DPM to not mark the collocated tapes ready for offsite until after the 2 week write period setting?
    Nope.  I'm still not sure exactly how DPM decides to mark collocated tapes ready for offsite.  I don't use "ready for offsite".  I'm using the SQL query mentioned here to manage tape rotation.
    Monday, May 30, 2016 12:24 AM
  • Cool. So basically you are creating a custom report from the vw_DPM_Media view? We are perfectly fine with the built-in Tape Mangement report but if this is better I am open to switching.

    Thanks for the replies. I will experiment with the collocation settings.

    Monday, May 30, 2016 8:36 PM
  • I had troubles with the reliability of vw_DPM_Media.  It would sometimes include only a subset of known tapes in the results.  That's why I am using that custom query built from the three tbl_MM_x tables.  I'm not using any of the built-in reports so I can't really comment on them.

    Monday, May 30, 2016 8:44 PM