none
consistency check - condition and performance RRS feed

  • Question

  • Hi, we're using DPM 2012 SP1. DPM is running as a VM on a Hyper-V cluster and we recently had a networking problem where the DPM VM couldn't communicate with some VMs on the cluster for a while. This problem was resolved, but DPM forced all the replicas inconsistent and wants to do a consistency check. Why is this happening? I don't understand why when there's a communication problem, that DPM should force the replica to be inconsistent?

    My second question is regarding the performance while doing a consistency check - from my observations, it's very expensive IO operation - it goes slowly at about 25-30MB/s, but it's using 500-600 IOPS, which is a lot. Is there a way to reduce the IOPS impact? Can you perhaps force DPM to use larger block size or something?

    Thanks in advance

    Thursday, February 7, 2013 4:53 PM

All replies

  • HI,

    If the VM's were in the middle of a backup when the network glitch occurred, that would result in the data source becoming inconsistent.  Look for failed synchronization jobs.

    Consistency check is definitely an IO intense operation, DPM basically checks each block of the .VHD's on the protected server and compares those blocks on the DPM replica.  The amount of data transferred is reflective of the sum of agent communications (crc blocks) and actual changed data.   The amount of time to perform the CC is dependent on the disk, network IO speeds, and the size of the data that needs compared. 

    FYI - we are working on improving CC performance.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, February 7, 2013 5:38 PM
    Moderator
  • Hi Mike, thanks for responding. The glitch occurred before the backup window, but lasted even during the time that backups were scheduled on. So DPM already didn't have the connectivity when backup was about to happen. So I assumed it would fail the current synchronization job, but it would still have the previous backup, which was ok.

    From what happened, I understand (and hope it's not true) that anytime there's communication problem with the agent and a backup (synchronization job) is about to happen, it will make the replica inconsistent, which just seems wrong to me. Is this true?

    From your answer I gather there's no way to control its behavior with respect to block size/IOPS? Besides network throttling, but we have that configured and it doesn't do much for IOPS, it just limits the transfer rate.

    Friday, February 8, 2013 7:02 AM
  • afaik yes, every time when a job is scheduled to run and fails, for whatever reason, the replica will be come inconsistent.

    The inconsistent state is just a warning that there are probably differences between your current data and the replica, thats why it pops up so quickly.

    the previous points are what they are, previous recovery points so they stay unaffected but if you have a daily sync setup and it fails, i would be happy that my backup is telling me that there was/is an error with a protected volume/server.

    On limiting the IOPS its not included in DPM, so external software might help you there

    Friday, February 8, 2013 12:19 PM
  • Well the problem with that is when this network communication problem occurred (twice already), it forces 100+ VM backups to go to inconsistent state, which isn't pretty, because the CC takes literally days and is very IO intensive.

    Don't get me wrong, I want to be notified when jobs fail or when there's another problem. I just fail to see, why it makes the replica inconsistent because of a transient communication problem.

    But I guess it's more or less an academic discussion.

    Friday, February 8, 2013 12:35 PM
  • Hi MarkosP,

    DPM 2012 SP1 introduced true bock level tracking of VM's and now uses the dpmfilter to detect / track the changes. If for any reason we cannot track block level changes (IE: A cluster node blue screens) or unplanned failover (maybe due to storage issue) - DPM has to abandon the block level tracking and resort to CC to be sure we capture all changed blocks.  So, what I would need to know is exactly what kind of communication issue you faced.  Did it effect csv volumes ?

    Can you make this occur at will, and if so, what would you do to repro this condition ?   If we can repro it, we can investigate the behavior and either explain it, or try to fix it.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Friday, February 8, 2013 3:24 PM
    Moderator
  • Hi Mike,

    alright, first I should say we're not backing up whole VMs, we're backing up from inside guests (volumes and SQL backups mainly).

    The problem didn't affect CSV or nodes, it affected only the DPM2012 VM and one other VM AFAIK. When this happens, the DPM VM stopped communicating with SOME VMs on other nodes in the cluster, while it could communicate with all VMs on the same node and some on other nodes. I guess this is some ARP cache issue on physical switches (it's not resolved yet). Anyway it can be (temporarily?) resolved by live migrating the affected VM to another cluster node (and back to original if desired), then the communication is restored. I can't really repro it as it has happened twice so far and randomly.

    During that time, DPM couldn't communicate with some of the VMs, hence lost connection to the agents running inside those VMs.

    If there's something you want me to try/check for when (if) this happens again, let me know.

    Monday, February 11, 2013 5:44 AM
  • Hi,

    OK, then the whole block level tracking at the host level does not apply, instead the DPMfilter inside the guest keeps track of block level changes of the application data that we're protecting. DPM does not know the guest is running on a cluster, so any cluster related communications issues would never effect ALL vm's.  What could happen is while a synchronization is taking place for protected data source(s), DPM agents loose communication which causes the sync(s) to fail, and that in turn could cause the replica to be inconsistent- but only for the data sources inside that VM that were being backed up at the time of the sync failure. 


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Monday, February 11, 2013 4:09 PM
    Moderator