locked
Consistency check takes forever on large volumes

    Question

  • Hello Everyone,

    We have a site using DPM 2010, which in general is running okay backing up a 3 node cluster (HP Servers, Lefthand iSCSI SAN).

    We have a file server which has around 15TB of data on it, 10TB of which is on one drive. We have had problems with this server crashing or hanging due most recently to the Windows Search protocol handlers choking on a file and freezing the server. 

    The problem we have is everytime this happens, even if it's once a month for example, DPM then takes nearly a week doing a Consistency check over the drives.

    I have read other forum posts that this is how DPM works, it needs to confirm all the data on the replica is the same as the source etc and transfer diffs etc. 

    My question is there a better way to engineer our system to handle the crashes (yes we would like to stop them completely but for example DPM then crashed the server by doing a HyperV parent backup and freezing this server for 3+ hours trying to get a snapshot of the VM, that we then had to save/restore the VM and then reset it to get it working again.) This of course caused another CC and another week before we can then get a tape backup.

    We are looking at other backup software if we cannot get this resolved as for example we use ShadowProtect on 100+ other servers and it's much more robust with crashes etc.

    I was hoping someone could suggest how to fix this, or that DPM 2012 makes the CC faster/smarter/better etc. 

    From reading other forums splitting the data up into different volumes will help if one becomes inconsistent but not if the VM crashes/reboots because all the drives will have to CC.

    Thursday, March 15, 2012 6:23 AM

Answers

  • Hi,

    <snip>
    I was hoping someone could suggest how to fix this, or that DPM 2012 makes the CC faster/smarter/better etc. 
    >snip<

    There are currently no enhancements in DPM 2012 to specifically address the long CC time.  With that said, we do realize the pain that the long CC's are causing and we are investigating how to improve this.  There won't be a quick solution, so if you can work on eliminating the server crashes or other causes for the failed synchronizations, that will eliminate the need for CC to be done.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, March 16, 2012 12:16 AM

All replies

  • Hi,

    <snip>
    I was hoping someone could suggest how to fix this, or that DPM 2012 makes the CC faster/smarter/better etc. 
    >snip<

    There are currently no enhancements in DPM 2012 to specifically address the long CC time.  With that said, we do realize the pain that the long CC's are causing and we are investigating how to improve this.  There won't be a quick solution, so if you can work on eliminating the server crashes or other causes for the failed synchronizations, that will eliminate the need for CC to be done.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Friday, March 16, 2012 12:16 AM
  • Thankyou Mike for taking the time to answer this. Good feedback. We will continue to watch DPM grow into a more powerful and reliable product.

    Friday, March 16, 2012 2:59 AM
  • The issue with CC jobs on large fileserver volumes is a huge problem with DPM. I have no idea how this can be improved, but it needs to be addressed. One filesever I'm backing up also has more than 10 TB of data (but spread across several disks) and in the unfortunate event the cluster hosts crashes the CC jobs take several days to complete. If DPM would be able to create recovery points of the volumes it is doing CC jobs on then it wouldn't be such a problem, but you basically end up without several days of backup. Not fun!
    Friday, March 16, 2012 3:56 PM
  • The issue with CC jobs on large fileserver volumes is a huge problem with DPM. I have no idea how this can be improved, but it needs to be addressed. One filesever I'm backing up also has more than 10 TB of data (but spread across several disks) and in the unfortunate event the cluster hosts crashes the CC jobs take several days to complete. If DPM would be able to create recovery points of the volumes it is doing CC jobs on then it wouldn't be such a problem, but you basically end up without several days of backup. Not fun!

    Yup this is our pain, we can take a full week to CC after a crash, as we also throttle the agent during business hours, and because of this problem we are looking at other backup software. 

    For example ShadowProtect uses it's own system kernel driver to monitor changed blocks and after a crash a snapshot still only takes a few minutes, and ironically enough even Microsoft's own Previous Versions happily continues to create snapshots without rescanning everything after a crash. 

    So, I know it can be done and unfortunately the time has passed where it's acceptable to wait a week for a backup to complete. 

    The other scary thing is even if you had a good CC on the DPM server, and then the source server crashes, you cannot put the successful backup already on the DPM server onto tape. you have to wait for DPM to finish the CC.


    Friday, March 16, 2012 7:02 PM
  • Problem Reported:   When Data Protection Manager needs to run a consistency check against a very large volume, the consistency check takes an excessive amount of time to complete.  While consistency check is ongoing, normal backups cannot be made which may affect SLA.

    The amount of time it takes to perform a consistency check against a data source has many variables.

    1) The number of files and directories on the protected volume.  Millions of small files will take much longer than a handful of larger files of equal total space used.
    2) If you have SIS enabled, that effects Consistency Check times. (especially if you SIS files in recycle bin, so clear the recycle bin frequently on SIS enabled volumes)
    3) Disk I/O speed on both DPM and Protected server.
    4) How busy each of the disks are. (io/sec, queue length)
    5) network speed and bandwidth / utilization.

    Check out the section HOW TO CHECK FOR BOTTLENECKS at the bottom if you wish to investigate items 3,4,5  further.

    How to mitigate the problem if the issue is related to item 1. or 2. above.

    OPTION-1 -  - Convert the Physical machine to a Virtual machine and let DPM protect the Virtual machine.

    With System Center 2012 Data Protection manager, you can protected Virtual machines on Windows 2008 R2 Standalone Hyper-V servers and DPM will perform block level change synchronizations. 

    With System center 2012 Sp1, this same block level protected is extended to include Windows Server 2012 based clustered Hyper-v guests on CSV volumes.

    This means that DPM 2012 will track block level changes to the .VHD files and perform true incremental backups.  These backups are very fast and efficient.  You also get the added benefit of being able to perform Item level Restore (ILR) of individual files and directories inside the .VHD’s using the recovery tab in the DPM Console.   This means you get the quick backups like you would with normal volume level backups for file servers, and granular restores.  Should a consistency check need to be ran against the VM’s .VHD files, they will much quicker since we’re doing block level compares of the .VHD versus file level compares for file protection.

    PRO:  What may have taken days to complete for a file system Consistency Check will only take minutes or hours for a Virtual machine.
    CON:  End user recovery (EUR) will not work for protected hyper-V virtual machines. 
              As a workaround, you can enable local shadow copies on the volumes inside the virtual machine, and clients can restore previous versions from local shadow copies.

    CON: With Windows Server 2008 R2,  hyper-v only supports .VHD sizes of up to 2TB – if your data set is larger than that, then you would need to segregate your data using option-2 below or use Windows Server 2012 hyper-V server which supports .VHDX files up to 64TB.
                  

     

    OPTION-2  Use NTFS Volume mount points to segregate your data across several smaller NTFS volume.

    This option is only viable if the data can be segregated.  If an application is writing to the volume and cannot handle the data being re-organized under mount points, then this option will not work.  If the data is written to and read from shares, then you should be able to re-organize the data to accommodate mount points.

    In my example below,  Host_Volume (H:) is just a very small ntfs volume that only holds the mount point directories and shares. 

    UserShare# is a folder that is shared for user access and is the empty mountpoint folder.   Mounted_Volume# is the underlying volume that holds the users directories and data.  DPM protects H: and the Mounted Volumes.  

    This configuration allows a chkdsk and / or a DPM Consistency Check to only run against the subset of users files on one volume which won’t take as long.  DPM protection continues as normal for the other volumes under protection.

    Host_Volume (H:)
       UserShare1 -->Mounted_Volume1
                              UserDir1
                              UserDir2
                              UserDir3
       UserShare2 -->Mounted_Volume2
                              UserDir4
                              UserDir5
                              UserDir6
       UserShare3-->Mounted_volume3
                              UserDir7
                              UserDir8
                              UserDir9


    You can “grow” the Mounted_volume at the SAN level if you need more space on that disk for users data, then use diskpart.exe command to grow the NTFS file system into the new free space, that way you can start with smaller LUN’s and grow them over time as new users / data gets added.

    File Systems
    http://technet.microsoft.com/en-us/library/cc938934.aspx


    HOW TO CHECK FOR BOTTLENECKS
    =============================

    If you open resource monitor on both servers and select DPMRA and System processes during a Consistency Check, check disk IO and see if you have a bottleneck.

    You can also run a performance monitor.

    For some basic perfmon counters to help narrow down the possible bottleneck:

    Perf Counters for DPM
    ******************

    Logical Disk/Physical Disk
    \%idle
    • 100% idle to 50% idle = Healthy
    • 49% idle to 20% idle = Warning or Monitor
    • 19% idle to 0% idle = Critical or Out of Spec

    \%Avg. Disk Sec Read or Write
    • .001ms to .015ms  = Healthy
    • .015ms to .025 = Warning or Monitor
    • .026ms or greater = Critical or Out of Spec

    Current Disk Queue Length (for all instances)
    80 requests for more than 6 minutes.
    • Indicates possibly excessive disk queue length.

    Memory
    *******
    \Pool Non Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.

    \Pool Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.

    \Available Megabytes
    • 50% of free memory available or more =Healthy
    • 25% of free memory available = Monitor.
    • 10% of free memory available = Warning
    • Less than 100MB or 5% of free memory available = Critical or Out of Spec.

    Processor
    *******
    \%Processor Time (all instances)                                                                  
    • Less than 60% consumed = Healthy
    • 51% - 90% consumed = Monitor or Caution
    91% - 100% consumed = Critical
    *************************************************************


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Tuesday, February 05, 2013 12:43 AM