none
Nightly backup triggers CSV failure RRS feed

  • Question

  • We are experiencing a problem with our hyper-v cluster.  Each night at 6pm when the backups start, the cluster shared volume switches to a failed state.  This only happens if we have the load split between the two nodes, with some vms running one one node and some on the other.  With the load split, I can reliably trigger the failure by manually starting multiple backup jobs against the cluster.  (Hitting some vm's on each node.)

    We have an HP P2000 G3 SAN, direct connected by fibre to each node.  The p2000 has 2 controllers in it.  Each server has 2 FC HBA's in it, with one connected to each controller in the san. We have 2 volumes on the san, plus a snap pool.  One large volume is set up as the csv where all of our vm's are stored.  The other volume is much smaller, and is used as the quorum disk.  Each server has the hardware VSS provider installed and is configured with maxparalellbackups set to 3 as recommended by the technet article on using hardware vss.

    We have have a carepack on all the hardware involved, so I have contacted HP support, but don't seem to be making any progress.  So far, they haven't found any problems, but also haven't referred me to MS or otherwise said that it wasn't a HP issue. 

    I've seen a few other posts describing similar problems, but each seems to have a different solution, or no solution at all.  Some give up and switch to system vss provider, others say one of the various ms hotfixes for hyperv resolved the issue, but so far, nothing we've tried has helped.

    I did come across this article http://support.microsoft.com/kb/2549533 which I don't think is related, but it does contain one line that gave me pause:

    "However, even when using a VSS hardware snapshot provider, the error may still occur if you have a single CSV hosting all your Guests."

    Is it not recommended to have a single csv hosting all of the guests?  I haven't found any documentation one way or the other on that.  

    Any suggestions would be appreciated.  

    Monday, January 30, 2012 3:21 PM

Answers

  • See if this hotfix resolves your issue:

    http://support.microsoft.com/kb/2637197/en-US
    CSV LUNs fail if you use a VSS hardware provider to back up virtual machines on a Windows Server 2008 R2-based cluster


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    • Marked as answer by NeighborGeek Friday, March 2, 2012 2:55 PM
    Wednesday, February 22, 2012 6:43 PM
    Moderator

All replies

  • Hi,

    If I'm reading this correctly, the CSV disk is being put into a failed state by the Windows cluster service.  This usually only occurs if cluster cannot perform it's "is alive"  health check of the disk.  Since hardware snapshots place more work on the SAN, perhaps there is a perfromance problem with the SAN.    Continue working with HP and if they cannot find the problem, you should have the Windows cluster support team look at the cluster logs to see why the disk is going into the failed state.


    Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, January 31, 2012 3:09 AM
    Moderator
  • HP is finally at least coming back and saying that it's not a problem with the san.  They're pointing to it being a network bandwidth issue, but that doesn't seem to match what I'm seeing.  For testing purposes, I have only 2 guests on the cluster.  TestVM1 is on Node 1, and TestVM2 is on node 2.  If I initiate a consistency check on both Guest VM's, I see the CSV switch to redirected mode for a short time, but then eventually goes to Offline Pending and Failed status.  Watching resource monitor during this process, I'm not seeing any significant increase in network traffic on any adapter other than the one that the node should be using to communicate with DPM.  Even that adapter isn't anywhere near maxed out.  I don't see where there would be a network bottleneck at all.  

    The SAN is connected to both nodes via FC.  Each node has 4 nics, 1 connected to a DMZ, 1 to our internal network, 1 internal/management network, and 1 private network (crossover) for the cluster.  I don't think there's any reason we would need more than 1 dedicated 1Gbps connection between the two nodes for CSV redirection, is there?  I can't see how we possibly would, considering that we've moved all of our production load off the cluster and there are only 2 idle test VM's running on the san right now. 

    Wednesday, February 15, 2012 8:45 PM
  • Hi,

    You need to configure your cluster network using the following article.

    System Center Data Protection Manager 2010 Hyper-V protection: Configuring cluster networks for CSV redirected access
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;2473194

    If the CSV is going offline / failed in the cluster, there must be cluster events associated with that, so what are the details of those events.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Wednesday, February 15, 2012 10:11 PM
    Moderator
  • Okay, I adjusted the network config somewhat to line up with that document.  I've seen it before, and I think we were roughly in line with it, but I believe it's now as close to ideal as it can be with the networks and adapters available.  

    Events associated with the cluster disk failure: 

    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          2/15/2012 12:40:00 PM
    Event ID:      5121
    Task Category: Cluster Shared Volume
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      NODE2.domain.com
    Description:
    Cluster Shared Volume 'Volume1' ('Cluster Disk 1') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.
    Event Xml:
    <event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <system>
        <provider guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" name="Microsoft-Windows-FailoverClustering"></provider>
        <eventid>5121</eventid>
        <version>0</version>
        <level>2</level>
        <task>38</task>
        <opcode>0</opcode>
        <keywords>0x8000000000000000</keywords>
        <timecreated systemtime="2012-02-15T18:40:00.878404200Z"></timecreated>
        <eventrecordid>318813</eventrecordid>
        <correlation></correlation>
        <execution processid="6364" threadid="2728"></execution>
        <channel>System</channel>
        <computer>NODE2.domain.com</computer>
        <security userid="S-1-5-18"></security>
      </system>
      <eventdata>
        <data name="VolumeName">Volume1</data>
        <data name="ResourceName">Cluster Disk 1</data>
      </eventdata>
    </event>

    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          2/15/2012 12:40:37 PM
    Event ID:      1034
    Task Category: Physical Disk Resource
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      NODE2.domain.com
    Description:
    Cluster physical disk resource 'Cluster Disk 1' cannot be brought online because the associated disk could not be found. The expected signature of the disk was '4563FFE3'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.
    Event Xml:
    <event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <system>
        <provider guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" name="Microsoft-Windows-FailoverClustering"></provider>
        <eventid>1034</eventid>
        <version>0</version>
        <level>2</level>
        <task>18</task>
        <opcode>0</opcode>
        <keywords>0x8000000000000000</keywords>
        <timecreated systemtime="2012-02-15T18:40:37.687094300Z"></timecreated>
        <eventrecordid>318814</eventrecordid>
        <correlation></correlation>
        <execution processid="6508" threadid="10440"></execution>
        <channel>System</channel>
        <computer>NODE2.domain.com</computer>
        <security userid="S-1-5-18"></security>
      </system>
      <eventdata>
        <data name="ResourceName">Cluster Disk 1</data>
        <data name="DiskSignature">4563FFE3</data>
      </eventdata>
    </event>

    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          2/15/2012 12:40:37 PM
    Event ID:      1069
    Task Category: Resource Control Manager
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      NODE2.domain.com
    Description:
    Cluster resource 'Cluster Disk 1' in clustered service or application '94f78f0e-a995-4f83-ab72-1f435a3548ba' failed.
    Event Xml:
    <event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <system>
        <provider guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" name="Microsoft-Windows-FailoverClustering"></provider>
        <eventid>1069</eventid>
        <version>0</version>
        <level>2</level>
        <task>3</task>
        <opcode>0</opcode>
        <keywords>0x8000000000000000</keywords>
        <timecreated systemtime="2012-02-15T18:40:37.687094300Z"></timecreated>
        <eventrecordid>318815</eventrecordid>
        <correlation></correlation>
        <execution processid="6364" threadid="7708"></execution>
        <channel>System</channel>
        <computer>NODE2.domain.com</computer>
        <security userid="S-1-5-18"></security>
      </system>
      <eventdata>
        <data name="ResourceName">Cluster Disk 1</data>
        <data name="ResourceGroup">94f78f0e-a995-4f83-ab72-1f435a3548ba</data>
      </eventdata>
    </event>

    Log Name:      System
    Source:        Microsoft-Windows-FailoverClustering
    Date:          2/15/2012 12:40:38 PM
    Event ID:      1205
    Task Category: Resource Control Manager
    Level:         Error
    Keywords:      
    User:          SYSTEM
    Computer:      NODE2.domain.com
    Description:
    The Cluster service failed to bring clustered service or application '94f78f0e-a995-4f83-ab72-1f435a3548ba' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
    Event Xml:
    <event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <system>
        <provider guid="{BAF908EA-3421-4CA9-9B84-6689B8C6F85F}" name="Microsoft-Windows-FailoverClustering"></provider>
        <eventid>1205</eventid>
        <version>0</version>
        <level>2</level>
        <task>3</task>
        <opcode>0</opcode>
        <keywords>0x8000000000000000</keywords>
        <timecreated systemtime="2012-02-15T18:40:38.236118200Z"></timecreated>
        <eventrecordid>318818</eventrecordid>
        <correlation></correlation>
        <execution processid="6364" threadid="10996"></execution>
        <channel>System</channel>
        <computer>NODE2.domain.com</computer>
        <security userid="S-1-5-18"></security>
      </system>
      <eventdata>
        <data name="ResourceGroup">94f78f0e-a995-4f83-ab72-1f435a3548ba</data>
      </eventdata>
    </event>

    Log Name:      System
    Source:        Foundation Agents
    Date:          2/15/2012 12:42:24 PM
    Event ID:      1168
    Task Category: Events
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      NODE2.domain.com
    Description:
    Cluster Agent: The cluster resource Cluster Disk 1 has failed. 
    [SNMP TRAP: 15006 in CPQCLUS.MIB]
    Event Xml:
    <event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <system>
        <provider name="Foundation Agents"></provider>
        <eventid qualifiers="50229">1168</eventid>
        <level>2</level>
        <task>4</task>
        <keywords>0x80000000000000</keywords>
        <timecreated systemtime="2012-02-15T18:42:24.000000000Z"></timecreated>
        <eventrecordid>318822</eventrecordid>
        <channel>System</channel>
        <computer>NODE2.domain.com</computer>
        <security></security>
      </system>
      <eventdata>
        <data>4</data>
        <data>4</data>
        <data> = </data>
        <data>Cluster Disk 1</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <data>0</data>
        <binary>6500080A00000000436C7573746572204469736B203100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000506879736963616C204469736B00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000040000004348432D485632000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000040000004469736B30506172746974696F6E3120000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000039346637386630652D613939352D346638332D616237322D31663433356133353438626100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000400000000000000000000000000000000000000000000000000900435C40700000004000000FFFF000080000E0300000000080A65000407FFFF00000400000000000000040000000000000000000000000000000000000000000000FF80FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF01000100000000000000000004000000000000000000000000000000000000000000000000000D00000000000000408FDC0000000000</binary>
      </eventdata>
    </event>

    I really don't see much in those events, but hopefully you can glean more out of them.  There's also an informational event just before that last  one, showing that the hardware vss service stopped. 
    • Edited by NeighborGeek Thursday, February 16, 2012 4:46 AM
    Thursday, February 16, 2012 4:42 AM
  • The above event messages have nothing to do with networking and I have no clue why HP would be directing you in that direction.

    Looking at your hardware configuration again, these are Fibre attached disks, not iSCSI, so networking has no play here.

    <Snip>
    We have an HP P2000 G3 SAN, direct connected by fibre to each node.  The p2000 has 2 controllers in it.  Each server has 2 FC HBA's in it, with one connected to each controller in the san.
    >snip<

    I think you need to get HP to look at the switch logs and figure out why that disk is disappearing from windows, the error is clearly stating that cluster cannot find the physical disk with the the disk signature 4563FFE3.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Thursday, February 16, 2012 3:22 PM
    Moderator
  • HP says it's a network bottleneck when the CSV switches to redirected mode and sending all of the disk i/o over the network between nodes.  I really don't see that, but I'm still jumping through all of the hoops they reccommend, so we can move on and get them to keep looking.

    There is no FC switch involved, each node is directly connected each of the 2 controllers on the P2000.  

    I can't point to any hard facts that support this, but my gut feeling is that the problem is something in the hardware VSS provider, just based on bits I've read here and there from other users with similar issues.  So when HP keeps coming back to me saying it's not an MSA problem, that might be technically correct.  It may not be a hardware problem with the san, but the software they provide for the hardware vss snapshot functionality.  It's also certainly possible it's a configuration issue on the cluster, but I can't find it. 

    The funny thing is, the Quorum disk is also on the san, and right now there is a 2nd csv on the same san (created as part of testing this issue).  Only one CSV fails when this happens, the disk that holds the vm's being backed up.  To me, that says it's not a hardware failure.  

    I came across the 'get-clusterlog' powershell commandlet yesterday, and am looking at the detailed log file it creates.  So far it's not making much sense to me, is there anything in that file that would be of use if I posted the relevant 2 minute section here?

    Thursday, February 16, 2012 4:08 PM
  • I'm having almost exactly same symptoms in our 2 node CSV cluster, even Quorum and second smaller CSV behaves exactly same and errors in event. Only difference is that I can't duplicate problem with manually running DPM backups. It is just random crashes, sometimes 4 times a week and sometimes once a week. On other days backup is running smoothly with no problems at all.

    Funny part is when comparing to this case: CSV is running on Dell Equallogic iSCSI SAN with Dell R610 as a hosts (Dedicated network cards for CSV/double iSCSI/Mgmt etc.)

    HIT 4.0 (HW VSS) is installed and latest firmware on EQL and almost every hotfix what can find for Windows Hyper-V/DPM/FailoverClusters. And now i'm testing DPM with MaxAllowedParallelBackups = 1 if there is any remedy for this.

    I also contacted Dell EQL Support and they basically said that issue is MS, not Dell. I really don't who to believe, but because there is no disconnected iSCSI connections when CSV in cluster goes offline it might be some MS related bug?


    • Edited by MikkoSu Thursday, February 16, 2012 9:35 PM
    Thursday, February 16, 2012 9:33 PM
  • I've read somewhere that this same san hardware is sold by dell, I don't know what they call it, so it may or may not be the same thing hardware you have. 

    I did realize today that when I was trying to monitor network acivity during this incident yesterday, I wasn't looking at the right information.  Apparently resource monitor doesn't show the actual traffic flowing through the physical nic, just traffic to/from the virtual nic used by the host.  Watching again in performance monitor, I am seeing more like what I'd expect, but I still don't see that my network is saturated to the point it would be failing the disk.

    As for being reproduceable, I have found that it happens when backup jobs start for 2 guests, running on different nodes of the cluster, at the same time.  If one job starts a couple of minutes later, then it doesn't seem to fail.  Watching failover manager, I can see the disk switch to offline pending, then switch ownership to node 1, go into redirected mode, after a few moments it goes back to offline pending, then switches to Failed about the same time it switches ownership back to node 2. 

    It's like it redirects the first time fine, but then when the other node tries to take ownership and redirect access to start the snapshot of it's guest, it can't for some reason, and fails the csv completely.

    Thursday, February 16, 2012 10:08 PM
  • See if this hotfix resolves your issue:

    http://support.microsoft.com/kb/2637197/en-US
    CSV LUNs fail if you use a VSS hardware provider to back up virtual machines on a Windows Server 2008 R2-based cluster


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    • Marked as answer by NeighborGeek Friday, March 2, 2012 2:55 PM
    Wednesday, February 22, 2012 6:43 PM
    Moderator
  • Mike -- The description in that article looks like it might be exactly what we're seeing.  I actually opened an incident with MS support on the 16th, which looks like the same day that hotfix was released.  Due to a few delays on my part, we only got very far into working on the issue yesterday, and the DPM support rep made a few changes on the cluster nodes that seemed to help, but also seem like maybe more of a workaround than a direct fix.  The rep is unavailable today, so I've emailed him to ask about this new article and will get his opinion on it tomorrow.

    The changes he made were actually the exact registry keys referenced in the article I asked about in the OP: http://support.microsoft.com/kb/2549533 .  The specific registry settings were:

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Agent\CSV]

    "CsvMaxRetryAttempt"=dword:000000C8

    "CsvAttemptWaitTime"=dword:0002bf20

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Data Protection Manager\Configuration]

    "AutoRerunDelay"=dword:0000003c

    "AutoRerunNumberOfAttempts"=dword:00000005

    I'm still a bit confused by that article, since it says this is a problem related to using software vss, but might possibly occur even when using hardware vss.  That gives me the impression that these registry keys should normally NOT be used with a hardware vss provider, unless this specific problem is occuring.  Is that right?

    If that's the case, I think we may need to remove the registry keys that were just added and install this new hotfix, then test to see if that fixes the problem without these registry settings. 



    • Edited by NeighborGeek Wednesday, February 22, 2012 7:33 PM
    Wednesday, February 22, 2012 7:31 PM
  • Hi,

    Let me address your concerns about the registry entries.  They can help regardless of what vss provider you are using.  They mostly help when you are using a HW VSS provider and only have a single CSV and many cluster nodes.  Since we can take up to 3 backups / node at one time, the DPM agent on each node tries to take ownership of the CSV. It takes time to make the hardware snapshot before it can be moved, the DPM agent will only try a few times by default before failing the backup.  The registry keys control how long to wait and how many times to retry taking control of the CSV before failing the backup.   If you have just a couple nodes with many CSV disks, then the defaults usually work fine since there is little contention for any single CSV.   For your reported issue, I don't see how those keys would help, unless there is a timing issue where taking control of a CSV immediatly after a hardware snapshot causes problems, in which case waiting longer to take control is what the CsvAttemptWaitTime does and may have helped.

    I hope the hotfix resolves your issue outright.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Wednesday, February 22, 2012 7:52 PM
    Moderator
  • Thanks.  So here is what I think may be happening:

    DPM starts the remote agent on each node to backup guests

    Node1 claims the CSV before the other and creates a snapshot.

    Node2 isn't far behind, and tries to claim the csv

    Node2 can't claim the csv because node1 still has it in redirected mode.

    Node2 schedules a retry (I don't see the default delay listed in KB2549533, but it must be fairly short)

    The snapshot for Node1 is finished and mapped to node1

    Node1 no longer requires exclusive access to the CSV to continue the backup

    Node2 comes back and tries again to claim the CSV

    Since the backup of the guest on node1 is still in progress, that triggers the CSV failure described in KB2637197.

    I suspect that by adding a 3 minute delay with "CSVAttemptWaitTime=dword:0002bf20", it's allowing enough time for the backup of node1's guest to finish.  When Node2 tries to claim the csv 3 minutes later, there isn't a backup in progress anymore, so the CSV failure doesn't occur.  Right now, we have no production guests on the cluster, just 2 "Test Server" guests.  Since those are freshly built 2008R2 boxes with no load on them, the backup finishes pretty quickly.  If we were to move all of our production guests back on to the cluster, the 3 minute delay may not be enough to avoid the CSV failure, since the backup running on node1's guest would take longer.

    Does that make sense?

    Wednesday, February 22, 2012 8:34 PM
  • Hi,

    Not quite right, the retry will succeed once the CSV comes out of redirected mode, which for a hardware snapshot should be less than 2 minutes.   I don't know the technical details of the fix included in KB2637197 to comment on it.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Wednesday, February 22, 2012 10:15 PM
    Moderator
  • Okay.  I read KB2637197 to say that if the owner changes while the 1st backup is still ongoing it it will cause the csv failure.  When this happens, it looks like the ownership change does take place (Failover manager shows the new owner for the csv), but the csv fails just as ownership changes.  Regardless, I'll talk with the support rep tomorrow and hopefully try this hotfix out in the next day or two.


    Thanks!

    Wednesday, February 22, 2012 10:32 PM
  • Hi Gai-jin did the KB2637197 really solved your problem?

    I'm fighting for more than one year to backup my CSV and opened a similar a question at http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/0d428c9f-072d-4ba2-8a7a-d16b84078a47

    Thanks!
    --marcos


    marcos@mirasoft.com.br

    Thursday, April 12, 2012 2:34 PM
  • Yes, that hotfix did seem to resolve this specific issue.  I'm still having another issue with the CSV, which may or may not be related to the failing csv problem.  It started a week or so before I installed that hotfix, so it definitely was not caused by the hotfix. 

    I read through your thread, and I would certainly encourage you to try the hotfix if you haven't already, as it does seem like that problem is just what you're seeing.

    As far as my other issue, I still have an open ticket with MS and HP, and haven't really gotten anywhere.  That issue is in a separate thread, since I don't know that it's related to this one other than it's issues with backing up the CSV.

    http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/6547c78b-2ad7-4721-b22f-bca69e18f8bc

    Thursday, April 12, 2012 5:25 PM
  • Hi Gai-jin

    Unfortunately the hotfix didn't solve my issue (http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/0d428c9f-072d-4ba2-8a7a-d16b84078a47), thanks for answer.

    --marcos


    marcos@mirasoft.com.br

    Saturday, April 14, 2012 12:34 AM
  • Hi,

    Hotfix solved my issue - almost. I only had 1-2 CSV crashes after applying patch (it was like 20-30 before HF). Those 1-2 crashes only happened after when I tried to change MaxAllowedParallelBackups back from 1 to 3 (Default is 3).


    Friday, April 27, 2012 7:24 AM
  • Installing KB2406705, KB2522766 and KB2637197 has drastically reduced the error messages we were seeing in our event logs. We are also no longer experiencing resource failover during DPM backups. So far we have not had a CSVs go offline nor node crash since applying KBs. MaxAllowedParallelBackups is set to 3, we have 26 VMs stored on two 1TB CSVs in a Win2k8 Datacenter R2 three node cluster, and are backing up over 100 other resources at the same time. We have a two member EqualLogic PS6100 series SAN using one storage pool. We have the latest EQL firmware/HIT installed and are using SCDPM 2012. ASM is configured on each node and is using a non-crusted share as recommend here http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/0d428c9f-072d-4ba2-8a7a-d16b84078a47.

    James

     
    Thursday, May 10, 2012 3:26 PM
  • Things ran great for almost a week and crash! Came in this morning and the secondary CSV was critical and half our VMs were offline. Back to the drawing board...
    Tuesday, May 15, 2012 3:21 PM
  • Hi Jamests,

    Honestly I gave up trying to use hardware snaphots and Equallogic PS6010XV. "Sometimes" you have only 5121 events and suddenly you end up with a CSV disk signature change that breaks everything and renders a lot of virtual machines unusable ...

    I'm using software snapshots and backing up one VM at a time, the CSV remains in redirected mode during more time and no 5121 events are logged. I have this setup for a long time without any problem.

    I understand that for large data centers hardware snaphots are the way to go but as I said "it never worked for me".

    Hope that helps,
    --marcos


    marcos@mirasoft.com.br

    Thursday, May 17, 2012 5:09 PM
  • Thanks for the post, Marcos. I was able to resolve the signature change problem by following the recommendations of the link I provided above. I have changed MaxAllowedParallelBackups to 1 as recommend in numerous forums which seems to have resolved the CSV crashing. I will hope for the best and wait for someone to come up with a viable solution/hotfix to parallel backups of CSV VMs.

    James

    Thursday, May 17, 2012 5:25 PM