none
DPM 2010 backups of Hyper-V CSV Failover Cluster causes VMs to 'stop' RRS feed

  • Question

  • I have recently setup my DPM server to backup my failover cluster via this link http://technet.microsoft.com/en-us/library/ff634192.aspx & http://support.microsoft.com/kb/2473194.  I made these modifications because I was having VM backups fail because of a VSS error "Failed to prepare a Cluster Shared Volume (CSV) for backup as another backup using the same CSV is in progress"  

    After making these changes - My VMs now randomly stop when being backed up.  I have a 3 node cluster running Server 2008 R2 NON SP1.  Any help is greatly appreciated.

    Zach


    Zach Smith
    Wednesday, April 20, 2011 1:24 PM

Answers

  • It appears the discovery of an incorrect IP address and the above mentioned hotfixes resolved this problem.  If you still need help, please re-open the thread with an update.
    --------------------------------------------------------------------------------
    Regards, Michael V [MSFT] - This posting is provided "AS IS" with no warranties, and confers no rights.

     

    Tuesday, January 24, 2012 11:59 PM
    Moderator

All replies

  • Are you using Hardware Provider backups?  I have seen this before on Hyper-V R2 without SP1 as well and have yet to reproduce in on SP1.  I still get the Event 5121 (Cluster Shared Volume 'Volume1' ('CSV1') is no longer directly accessible from this cluster node...) but those seem to be benign and happen as the CSV hardware snap gets mounted.

     

    Rob McShinsky (http://www.VirtuallyAware.com)


    http://www.VirtuallyAware.com
    Wednesday, April 20, 2011 3:29 PM
  • Hi,

    I am having exactly the same issue (also on R2 SP1) and have a call open with MS. How many VMs are in your protection group? How many parallel? How many VMs are configured to backup in parallel?

    HKLM\Software\Microsoft\Microsoft Data Protection Manager\2.0\Configuration\MaxAllowedParallelBackups

    The default value is 3. I changed this to 1 and still have the issue. I get event ID 5121 as well as 5120.


    Microsoft Partner
    Wednesday, April 20, 2011 4:20 PM
  • For System VSS provider I am using serialized backups (1 parallel backup), protection groups range from 15-25 VMs.  I actually arrange my protection groups by the CSV that the VM lives on.  So for this particular Hyper-V R2 cluster, it has 5 1TB CSV volumes and 1 protection group for each CSV. For hardware provider backups you are supposed to be able to to jump the number of parallel backups up according to this.  During the problem times I have pushed it back to 1 in testing. 

     

    Migrating from the System VSS Provider to a Hardware VSS Provider

    http://technet.microsoft.com/en-us/library/ff634216.aspx

     

    Rob McShinsky  (http://www.VirtuallyAware.com)

     

     


    http://www.VirtuallyAware.com
    Wednesday, April 20, 2011 7:14 PM
  • I have my backups serialized via the above link.  This is 1 backup per csv / lun.  It seems to be an issue when a CSV volume is put in redirected access mode.  I'm still investigating. 

    When you say hardware provided backups - are you referring to within the SAN?  I am using DPM to backup my CSV's and only DPM.

    Zach


    Zach Smith
    Thursday, April 21, 2011 7:47 PM
  • Hardware assisted VSS is when the host prepares the VM and then passes the "snapping" to the SAN.  It requires a piece of software on your host servers that when, instructed by DPM to take the snap, it uses the hardware VSS provider instead of the built in software VSS provider.

     

    What is the speed of your network links? and how many VMs are on this 3 node cluster? What time of day do you usually backup these systems?

     

    Rob McShinsky (http://www.VirtuallyAware.com)

     


    http://www.VirtuallyAware.com
    Thursday, April 21, 2011 8:11 PM
  • It appears there's (at least) three of us with the same issues: http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/5802b8ca-ec15-4e22-b0de-f217daa9ffce and http://social.technet.microsoft.com/Forums/en-US/dpmhypervbackup/thread/1e6e3d30-fff4-4ab0-ad65-f7a7d184fdfd

    Please share any experiences, recommendations provided by Microsoft how to resolve this. Expecting the entire company to go down on a daily basis, because a backup is being executed is becoming quite a problem.

    Friday, April 22, 2011 6:10 AM
  • I have noticed it is about 6 or so VMs that will go down and only these.  Let me investigate to get all commonalities. 

    As for VSS - I am using Software VSS.  All networks for my cluster are 1Gb.  I have about 38 VMs total pretty equially spread across the 3 nodes.  The servers run 2x12-core AMD w/ 160GB RAM each.  One way or another this is getting figured out.  PErhaps R2 w/SP1 will help.  I've never upgraded a cluster before so I would assume i'd move all VM's off 1 node at a time and upgrade.  I just need to know what upgrading to SP1 would do to my cluster because for a short while i'll have 2 nodes at R2 and 1 node at R2 SP1.

    But anyways - about the backup issue - if this hasn't been figured out by next week - Monday/Tuesday i'm calling Microsoft.

    Zach


    Zach Smith
    Friday, April 22, 2011 12:22 PM
  • I'd just like to add that my two nodes are on R2 SP1, so it probably won't make a difference, but who knows.

    I'm using hardware VSS, well I have the hardware VSS installed on both nodes, from my understanding it takes preference over software. Is there another way to make sure?

    Friday, April 22, 2011 12:37 PM
  • Well I'm using software VSS - so if your hardware VSS is indeed being used - we have covered that as well. 

    So it appears - hardware/software VSS doesn't affect this nor does SP1/Non-SP1.  Once I get everything mapped out for the VMs that stop - hopefully that will help - hopefully there is something in common with all of those VMs.

    Zach


    Zach Smith
    Friday, April 22, 2011 2:00 PM
  • I don't have an explanation - but for a couple days - my VM's haven't stopped anymore when doing backups.  No clue?
    Zach Smith
    Saturday, April 23, 2011 9:30 PM
  • In my case, the problems only appear when using Hardware assisted VSS backups.  If I uninstall the Hardware VSS provider, I do not get any of the crashes.  I only get the performance robbing redirected mode of VMs for many hours a night limiting the number of VMs I can place on this cluster.

    I have seen this on RTM and SP1 of Hyper-V R2, where various VMs will reboot as a result of the Hardware assisted VSS snap process. 

     

    I am also seeing another issue that has come up is when a CSV is snapped as a result of a hardware assisted VSS backup, where the CSV gets marked as read only and on occasion does not return to write.  It takes a diskpart command to remove the read only attribute of the volume.  As a result the CSV goes offline and all VMs on the volume crash.  I have seen this 4 times over about 4 months.  Currently I have seen this on a cluster that is not in production yet, so I am attempting to find a way to repro this at will. 

     

    Rob McShinsky (http://www.VirtuallyAware.com)


    http://www.VirtuallyAware.com
    Monday, April 25, 2011 11:29 AM
  • Unfortunately I am still getting this issue. I have even configured the backups to run serially (only one backups at a time across all hosts). The problem  is completely random in that I cannot find a pattern (different VMs stop each time it occurs and I cannot predict when it will happen).

    I am using hardware assisted snaps. I have not setup software snaps (as I cannot afford the loss of performance).

    Anyone else have any further updates? I will be speaking with MS shortly.


    Microsoft Partner
    Tuesday, May 3, 2011 8:35 AM
  • While I'm still getting 5121 (quite expected as mentioned), my VMs haven't crashed on me for a while. I only had one crash during 3 weeks, but haven't changed anything since.

    I have no idea what caused it and when it will come back. Please provide futher details after you speak with MS.

    Tuesday, May 3, 2011 9:10 AM
  • Hi,

    I have installed the following hotfixes as directed by MS on each host in my cluster:

    http://support.microsoft.com/kb/2494016
    http://support.microsoft.com/kb/2494162

    Also can anyone else confirm whether they are also getting:

    Event ID: 5120
    Cluster Shared Volume 'Volume1' ('CSV Disk') is no longer available on this node because of 'STATUS_BAD_NETWORK_PATH(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    If not how do you have your NIC bindings order configured?


    Microsoft Partner
    Thursday, May 5, 2011 2:38 PM
  • Thank you for the update, it appears those two are not even part of SP1 (which I have on all nodes). Have this resolved anything in your case?

    I was getting 5120 months ago, before our storage (HP MSA P2000 G3) was even properly set up, meaning it was unable to create hardware snapshots since the snap pools weren't created. I haven't gotten them in quite some time now and there were no changes to the NIC bindings order.

    It's currently like this:

    Management

    Hearbeat

    Virtual Network

    Live Migration

     

    Thursday, May 5, 2011 2:51 PM
  • Hi,

    I don't know if it has resolved it yet as I only just applied the hotfixes and the VMs stopping were completely random (last was a gap of 10 days between vms stopping) :(

    However I will be increasing the frequency of backups to try and force the issue to occur more frequently (as I cannot put production servers on the cluster until I am satisfied that the issue is resolved).

    Regarding your NIC binding order I would really appreciate it if you could run the following powershell script on your hyper-v hosts as I would like to see all of the binding orders (including the hidden MS failover cluster virtual adapter & Microsoft Virtual Switch Adapter).

    To run the script save it as nicBindingOrder.ps1 and run it from the powershell console as > ./nicBindingOrder.ps1 servername
    It will output the complete binding orders :)

    Thanks,

    #get-binding order
    #Check a remote machine to see NIC binding order.
    
    $server = $args[0]
    
    if ($server -eq $null -or $server -match "\?") {
     Write-Host -ForegroundColor "yellow" "Usage: Get-NicOrder "
     Write-Host -ForegroundColor "yellow" "  Enter the name of a system to connect to. This script will"
     Write-Host -ForegroundColor "yellow" "  provide the network card binding order of a remote machine."
     exit
    }
    
    $key = "System\CurrentControlSet\Services\Tcpip\Linkage"
    $type = [Microsoft.Win32.RegistryHive]::LocalMachine
    $regkey = [Microsoft.win32.registrykey]::OpenRemoteBaseKey($type,$server)
    
    if (-not $?) {
     Write-Host -ForegroundColor "red" "Cannot check remote machine, exiting...."
     exit
    }
    
    $regkey = $regkey.opensubkey($key)
    $ArrBindingGUIDs = $regkey.getvalue("Bind")
     
    $ArrNicList = Get-WmiObject -ComputerName $server -Query "select Description,settingid,ipaddress,ipsubnet,defaultipgateway from win32_networkadapterconfiguration"
    
    $ArrResults = New-Object collections.ArrayList
    
    for ($i = 0; $i -lt $ArrBindingGUIDs.length; $i++) {
      if ($arrbindingguids[$i].contains("{")) {
     $guid = $ArrBindingGUIDS[$i].substring($ArrBindingGUIDs[$i].indexof('{'))
     foreach ($nic in $ArrNicList) {
      if ($nic.settingid -eq $guid) {
      $result = New-Object psobject
      Add-Member -InputObject $result NoteProperty BindingOrder $i
      Add-Member -InputObject $result NoteProperty Description $nic.Description
      Add-Member -InputObject $result NoteProperty GUID $nic.SettingID
      Add-Member -InputObject $result NoteProperty IP $nic.ipAddress
      Add-Member -InputObject $result NoteProperty SubnetMask $nic.ipsubnet
      Add-Member -InputObject $result NoteProperty Gateway $nic.defaultIpGateway
      $Arrresults.add($result) >$null 
      }
     }
     }
    }
    
    Return $arrResults
     
    
    
    

     


    Microsoft Partner
    Thursday, May 5, 2011 5:00 PM
  • I see. I actually had a hunch, that my VMs failed only when backing up large amounts of data, perhaps increasing the frequency of your backups will not produce the crashes. Maybe you can try to wait a few days and then backup when a lot of data has changed.

    Here is the output:

    http://cid-8af2f67a3f3cd55b.office.live.com/self.aspx/.Public/node1.txt

    http://cid-8af2f67a3f3cd55b.office.live.com/self.aspx/.Public/node2.txt

     

    Friday, May 6, 2011 8:14 AM
  • I see. I actually had a hunch, that my VMs failed only when backing up large amounts of data, perhaps increasing the frequency of your backups will not produce the crashes. Maybe you can try to wait a few days and then backup when a lot of data has changed.

    Here is the output:

    http://cid-8af2f67a3f3cd55b.office.live.com/self.aspx/.Public/node1.txt

    http://cid-8af2f67a3f3cd55b.office.live.com/self.aspx/.Public/node2.txt

     


    Thats interesting: Your bindings are different. Specifically I notice that on Node1 the MS Failover Cluster virtual adapter is third in your binding as opposed to first on Node2. In my four node cluster I found that the MS Failover cluster virtual adapter autiomatically was bound as the first adapter on three of my nodes but on one of them was listed third due to an faulty Quad port NIC that I had to replace. I changed the binding orders to be exactly the same across all of my nodes (but I really don't know if it makes any difference).
    Microsoft Partner
    Monday, May 9, 2011 12:20 PM
  • I noticed this as well when outputing, but haven't made any changes so far. Where have you read that this makes a difference, is this just a recommendation for DPM or a Hyper-V cluster itself.

    Monday, May 9, 2011 1:37 PM
  • I do not have any documentation regarding what is recommended but was advised to ensure my bindings were consistent across all nodes in the cluster from MS tech support. Whether this makes any difference or not I really do not know and that is why I was interested to look at how other peoples cluster bindings are configured.
    Microsoft Partner
    Tuesday, May 10, 2011 8:21 AM
  • All clear, thank you for the clarification. Keep us updated, I'll do the same if anything major changes/happens.
    Tuesday, May 10, 2011 8:34 AM
  • Anyone still getting VMs stopping?
    Microsoft Partner
    Thursday, May 12, 2011 1:48 PM
  • any progress?

    Thursday, May 19, 2011 7:57 AM
  • I still haven't had any VMs stop since applying the patches but am still get loads of 5120 errors.

    Event ID: 5120
    Cluster Shared Volume 'Volume1' ('CSV Disk') is no longer available on this node because of 'STATUS_BAD_NETWORK_PATH(c00000be)'. All I/O will temporarily be queued until a path to the volume is reestablished.

     


    Microsoft Partner

    Thursday, May 19, 2011 3:41 PM
  • Just an update that I also managed to resolve the 5120 errors.

    I discovered that the IP addressing on my cluster was incorrect and that the Live Migration and CSV networks were using the same subnet:

    Incorrect Subnet Configuration:
    Live Migration: 192.168.1.x 255.255.255.0
    CSV: 192.168.2.x 255.255.0.0

    Correct Subnet Configuration:
    Live Migration: 192.168.1.x 255.255.255.0
    CSV: 192.168.2.x 255.255.255.0

    I have also seen another 5120 error occur but with a different message:

    Cluster Shared Volume 'Volume1' ('CSV Disk') is no longer available on this node because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    I have seen this occur when the Domain Controller used by the cluster is also hosted on the cluster and a child DPM backup is taken of the domain controller. The hosts lose connectivity to the DC and this can cause the VMs to stop.


    Microsoft Partner
    • Proposed as answer by rEMOTE_eVENT Tuesday, November 29, 2011 8:54 AM
    Monday, October 3, 2011 9:02 AM
  • It appears the discovery of an incorrect IP address and the above mentioned hotfixes resolved this problem.  If you still need help, please re-open the thread with an update.
    --------------------------------------------------------------------------------
    Regards, Michael V [MSFT] - This posting is provided "AS IS" with no warranties, and confers no rights.

     

    Tuesday, January 24, 2012 11:59 PM
    Moderator