none
DPM 2010 eval / 2008 r2 server core 4 node cluster / CSV / VSS hardware provider RRS feed

  • Question

  • I am using all of the technologies in the title in my environment. Seems that after installing the VSS hardware provider, If I attempt to synchronize or create a replica of mulitple VMs on the same cluster shared volume, something somewhere is going crazy and kicking a node out of the cluster for a few seconds. Almost seems as though there is contention for the CSV and one of the nodes is getting the boot. My questions are below...

    1. What is the recommended method to setting up protection groups of VMs when using clustered shared volumes?

        single protection group for all CSVs assuming i have two 500 GB CSVs or a protection group for each of my CSVs

    2. Shoud I be using the redirected access feature with the CSVs? If I turn it off what is the impact?

    Thanks in advance.

    Chris

    • Moved by MarcReynolds Thursday, January 19, 2012 2:11 PM (From:Data Protection Manager)
    Wednesday, June 2, 2010 12:28 PM

Answers

  • Thanks Chris for the detailed events. The coordinator node of the disk will change based on which VM is being backed up and which host is hosting it. In this case we need to dig deep and collect cluster logs and or CSV filter traces. I suggest you open a support incident with our product support to analyze it more closely.


    Thanks Shyama Hembram[MSFT] This posting is provided AS IS, with no warranties, and confers no rights.
    Friday, June 4, 2010 4:45 AM
    Moderator

All replies

  • Can you check the failover cluster events and post some relevant events to see if there is something fundamentally wrong? There is nothing special you need to do if you have got the hardware VSS provider installed.


    Thanks Shyama Hembram[MSFT] This posting is provided AS IS, with no warranties, and confers no rights.
    Wednesday, June 2, 2010 7:20 PM
    Moderator
  • I receive event 5121 in category cluster shared volume. One of reach of the 3 other nodes while the coordinator nodes switches the CSV to the node where the VM is running. This happens if I manually kick off a job in dpm for a single virtual machine. And I believe is due to the redirected access that is turned on. Seems to be more of an infomational event than what is shown as an error. Question #2 in previous post.

    5121 description: Cluster Shared Volume 'Volume2' ('Cluster Disk 2') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

    However, when my schedule backups kick off in DPM for all VMs in the protection group, which could be multiple VMs on different hosts and multiple on each of my two CSVs, I get an additional error.  Event 1135 in category Node Mgr.

    1135 - Node Mgr descritpion:  Cluster node 'Node4' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    Thanks,

    Chris

    Thursday, June 3, 2010 1:35 PM
  • Thanks Chris for the detailed events. The coordinator node of the disk will change based on which VM is being backed up and which host is hosting it. In this case we need to dig deep and collect cluster logs and or CSV filter traces. I suggest you open a support incident with our product support to analyze it more closely.


    Thanks Shyama Hembram[MSFT] This posting is provided AS IS, with no warranties, and confers no rights.
    Friday, June 4, 2010 4:45 AM
    Moderator
  • Another thing to note is that at the same time the node is not available to the rest of the cluster, which causes VMs to migrate, is that I get an application error on the node stating the below.

    Source: Application Error

    EventID: 1000

    Faulting application name: clussvc.exe, version: 6.1.7600.16385, time stamp: 0x4a5bc614

    Faulting module name: KERNELBASE.dll, version: 6.1.7600.16385, time stamp: 0x4a5bdfe0

    Exception code: 0x80000003

    Fault offset: 0x0000000000032442

    Faulting process id: 0x700

    Faulting application start time: 0x01cb018974bf2dce

    Faulting application path: C:\Windows\Cluster\clussvc.exe

    Faulting module path: C:\Windows\system32\KERNELBASE.dll

    Report Id: fbd1513c-6f74-11df-9c41-001a64aeb106

     

    Friday, June 4, 2010 12:06 PM
  • To add some more to this for anyone else out there having a similar issues, I checked the MAXAllowedParrallelBackups key to see what it was set to in the RTM Eval build of DPM 2010 and it was set to 3. CSV's are also enabled with a VSS hardware provider installed so this may have been something else after a fresh install but mine was currently set to 3.

    For testing purposes, I changed this back to 1. I also changed the cluster service max heartbeat interval from 60 to 120 using the cluster.exe /prop command since I am getting an error related to the clussvc.exe not being available. I'm not sure if this is due to an application crash or a hang up causing the cluster to throw the error stating that it crashed so I figured can't hurt to up the interval.

    After DPM ran it's scheduled jobs against my two protection groups, one for each CSV. I came in this morning and CSV1 protections group was all successfully backed up with no errors. However CSV2 had a singe VM with a failed recovery point. After launching the Failover Cluster Management console, I did see the same errors that seem to correlate with this failure. Events 5121 (which I kind of expect), 1038, and 1135 which is critical and the removal of the node because of the clusscv.exe "crash".

    So, unfortunately, I don't think that changting either of those two values completely solved my issue. I will continue to post as I proceed to help anyone else out there. Frustrating!!!

    Chris

    Tuesday, June 8, 2010 11:53 AM
  • I have been doing some more digging and wondering if anyone can shed some light on one of cluster properties detailed below.

    1. HangRecoveryAction - Specifies the recovery action taken by the cluster service in response to a heartbeat countdown timeout which the defualt is 60 seconds.

    The default value of this in my cluster scenario is 3. This will invoke a bugcheck and create a system Stop error when a heartbeat countdown timeout occurs.The other possible values are as follows.

    ClussvcHangActionDisable

    0

    Disables the cluster heartbeat and monitoring mechanism.
    ClussvcHangActionLog

    1

    Log an event in the system log of the Event Viewer when a heartbeat countdown timeout occurs.
    ClussvcHangActionTerminateService

    2

    Terminate the cluster service when a heartbeat countdown timeout occurs. (default)

    • ClussvcHangActionDisable - Value of 0 - Disables the cluster heartbeat and monitoring mechanism.
    • ClussvcHangActionLog      - Value of 1 - Log an event in the system log of the Event Viewer when a heartbeat countdown timeout occurs.
    • ClussvcHangActionTerminateService - Value of 2 - Terminate the cluster service when a heartbeat countdown timeout occurs. (default)

    It states here that the defaul for this property is 2, yet mine is set to 3.

    Can anyone elaborate on this? Confirm what their default settting is for an R2 Server core cluster?

    If this settings is set to cause a stop error, would this cause my VMs to not live migrate to another node and possible end up being turned off / then back on hosted on another node?

    Chris

    Thursday, June 10, 2010 2:46 PM
  • Do you have any news on the problem? I am experiencing the same problem with a 3-node cluster. One node gets kicked out for a few settings forcing all VMs on that node to failover. I'm using DPM 2010 RTM.

    Thank you in advance!
    Martin

    Wednesday, June 23, 2010 8:47 PM
  • AI - M.Schmidt,

     

    I am still working with MS on the issue and hope to have some more information tomorrow. How about you?

    Does anyone know if the below link is updated elsewhere?

    http://blogs.technet.com/b/dpm/archive/2010/02/05/tested-hardware-vss-provider-table.aspx

     

    Thanks,

     

    Chris

    Wednesday, July 7, 2010 7:00 PM
  • I was able to work around the error. The problem is the Equallogic VSS.

    I uninstalled the HIT - and reinstalled the HIT only with VDS provider. No ASM, no Equallogic VSS.

    After that I configured my DPM to use software VSS snapshots and backups are working great now.

    Wednesday, July 7, 2010 7:13 PM
  • I am unfamiliar with the Equallogic product. What is the VDS provider? Also, are you able to backup CSV based VM storage with multiple hosts

    accessing the CSV for different VMs in parrallel mode with software based VSS?

     

    Chris

    Wednesday, July 7, 2010 7:20 PM
  • I'm experiencing this exact same problem on an 8-node Hyper-V cluster with CSV storage on an Equalogic PS-series SAN.  The Equalogic Hit Kit is installed on each node so the hardware provider is being used, but DPM is behaving as if it's only the software VSS provider being used.

     

    Thursday, July 8, 2010 2:17 PM
  • Jon,

    My issue is that it is actually causing my cluster service to crash and all my VMs to blue screen as they don't have the opportunity to live migrate to another node. I have 2 CSVs, well now 3, being utilized to host about 35 VMs across 4 nodes.

    How many CSVs do you have attached to your cluster and is your cluster R2?

    Chris

    Thursday, July 8, 2010 2:29 PM
  • Any news to this, I have the exact same problem but with a NetApp Hardware VSS provider...

    4 Node Cluster, 4x 250 GB CSV's, 1 node gets kicked out of the cluster and I/O access is redirected.

    Wednesday, August 25, 2010 1:27 PM
  • I'm seeing the same on a two node cluster running on HP BL465 G5 and connected to an EVA 4400. I was using the software provider and it worked fine with 1 parallel backup. After buying the HP BusinessCopy license and installing the VSS hardware provider I began seeing this behavior.

    Chris, any news on your support case with Microsoft?

    Regards,

    Nóri

    Monday, September 13, 2010 9:34 PM
  • I am seeing the identical problem on my end.

    4 node Windows Server 2008 R2 Core Cluster, sharing an iscsi attached csv on Dell EQ logic ps6000.

    Whenever parallel backups run, within a few minutes of the start of the backup one of the node gets kicked out of the cluster with a crash of its clussvc.exe.

    The node throws up all of its vms which all improperly shut down and migrate to other nodes with their tails between their legs.

    ANYONE have any updates to this issue?

    I HAVE opened a support case with microsoft, but currently they are all scratching their heads saying that they've never seen this before.

    • Edited by cheesewhip Monday, October 25, 2010 9:13 PM grammatical error
    Monday, October 25, 2010 9:12 PM
  • Hi, we are also seeing the identical problem.

    For now we have 4 node Windows Server 2008 R2 Datacenter Cluster, we have a tree CSV disk attached too a Dell Equallogic PSV5000

    When we do a VSS backup we loose a Node with following error

    First do we get:

    Event ID 1038

    Ownership of cluster disk 'Cluster Disk 2' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

    Then we get:

    Event ID 1135

    Cluster node 'mashs003' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

    The VM's that where on the MASHS003 node crache and are now moved to the other cluste nodes and their they start up again.

    it is a very serious problem, so if anyone have a clue let us know

    Jörg Wiesemann

    NAB Solutions AB

     

     

     

    Sunday, October 31, 2010 7:04 PM
  • I have also had this happen but it is not consistent and my backups are not running in parralel.  I am running a 2 node cluster on a Dell MD3000i server 2008 r2.

    In the DPM beta documentation it is recomended to serially backup vm's that are on a CSV.  I have run through the steps in the documentation but i still have my host crash and all VM's migrate about once a month it seems....

    Wednesday, November 3, 2010 1:36 PM
  • I have the same problem with a two-node Hyper-V cluster connected to an EqualLogic PS6000, HIT kit installed on both hosts. Would also appreciate some help!

     - Liam

    Wednesday, November 10, 2010 11:56 PM
  • Same here. We have an EVA8000 and we just went from using software provider to the EVA hardware provider. We have a three node cluster Windows2008R2 and are having the same issue. Anyone have a solution?
    Thursday, December 2, 2010 10:31 PM
  • I found this. I will apply to see if it fixes it.

    http://support.microsoft.com/kb/2277439

     

    Thursday, December 2, 2010 10:37 PM
  • I found this. I will apply to see if it fixes it.

    http://support.microsoft.com/kb/2277439

     

    I hope you have better luck with this one than I did. I applied this on 11/1/2010 and am still having problems. I have a case opened with Microsoft, will keep you posted as to the results.
    Friday, December 3, 2010 8:53 PM
  • MS support told me this was a bug they are working to release a patch for. No current general release fixes or patches will fix this issue.
    Friday, December 3, 2010 9:21 PM
  • That's cute - Is there a KB article posted on this yet?
    Friday, December 3, 2010 9:25 PM
  • not that i know of. it took them a month with me to admit that there is a bug track with this issue. they did have a preliminary fix they are working with, and wanted me to test it in my environment, but i declined as the servers that i am protecting are already in production. i opted to use their csv based serialized backup in place until the fix becomes publicly available.
    Friday, December 3, 2010 9:47 PM
  • Excellent. I'll see if I can't pry something out of them for my existing case. It would be nice to know when they actually fix this.

     - Liam

    Monday, December 6, 2010 6:02 PM
  • Let me put my hat in as having the issue to0. HP  EVA 8XXX, 4 node Hyper-V/2008 R2 cluster, DPM 2010, Backing up 92VMs across 4 CSV.  Just switched over to using the HP VSS hardware provider and have seen and Event 5121 as each VM is snapped.  I see the disk go offline, switches to director node, then to redirected mode for a short period, entire process takes about 60 seconds, then it goes back online and the VM seems to backup OK through DPM. 

    Before today I had been backing these up using the built in software provider, but due to extended I/O redirection using this method, I implemented the HP hardware provider.  I did not receive this error message using the software VSS provider.

     

    Lets hope for the best. 

    For those of you that have opened up a case with MS already, can you send me your case #.  Want to reference these.

    Rob (http://www.virtuallyaware.com - http://twitter.com/virtuallyaware)


    http://www.VirtuallyAware.com
    Thursday, December 9, 2010 3:16 AM
  • After what appears to be the customary runaround, Microsoft has admitted to me that they have this problem. They have an non-regression tested patch that they'll try once they have verified I have the problem they think I have. Will let you know how that goes.
    Tuesday, December 14, 2010 5:29 PM
  • Any indication from MS on when then patch may be released?  Whether publicly or through a support case if I can be bothered to raise one?

    I've got the exact same issue on a Hyper-V R2 3 Node cluster, EqualLogic PS6010XV SAN, and CSV backups of VM's through DPM 2010 and HIT (VSS hardware provider).

    It doesn't happen with every backup, but is intermittent and no discernible pattern.  Clussvc.exe crashes on one node, VM's migrate ungracefully and backup continues.  The service exception error is 0x80000003.

    I've been considering serialised backups without coming across this thread, but now that I've seen it, I'm convinced it's the only workaround until the patch is released.  Either that or just wait for the patch.  Hmmm.

    Tuesday, December 14, 2010 11:40 PM
  • Nothing yet. Still working on it with them. Will post updates as I get them.
    Tuesday, December 21, 2010 3:25 AM
  • Thanks Liso.  This really should be a high priority fix.  It's causing a lot of disruption to my production environment.  We chose Hyper-V and DPM over VMWare and this issue is starting to cast a lot of doubt on that decision.

    Come on Microsoft, please get it together!

    Wednesday, December 22, 2010 10:44 PM
  • Similar setup - multiple Hyper-V R2 clusters, CSV, EqualLogic PS Series SAN's, DPM 2010, HIT 3.4.2.5386, 100+ VM's.

    After many DPM errors, cluster failures, failed backups, and server crashes - and multiple support tickets with both Microsoft and Dell - I decided to just uninstall the hardware VSS provider (HIT).  I have not had any problems since.  I figured performance would be significantly worse, but not so at all.  DPM is working like a charm and I question whether the hardware VSS writer (although great in theory) really is even worth the effort.  I have enabled per-node serialization on all nodes and, again, haven't had any issues since uninstalling HIT (about 5-6 weeks ago).  We will soon be installing DPM on another domain, and I am recommending that we uninstall HIT from all nodes prior to install.

    FYI.

    Enabling Per Node Serialization

    http://technet.microsoft.com/en-us/library/ff634192.aspx

    Wednesday, December 29, 2010 11:48 PM
  •  

    Well after having my case closed on my and voicemail ignored for about 6 weeks, I managed to get ahold of someone again in the DPM department. The long and the short of it is thus:

    • They acknowledge there is a problem and a patch, but are not able to work to fix it until I provide them with the following: application/system logs from both cluster nodes during the crash, the cluster events from during the crash and the cluster.log file from the crash
    So I now will move a few of our VMs over to the second host and see how we fare. Will keep you posted.
    Tuesday, January 18, 2011 7:20 PM
  • liso,

    Any updates on your issue?

    Chris

    Friday, January 28, 2011 1:17 PM
  • As above - is there any fix for this problem yet? I have the same problem and its causing us a lot of pain!!
    Wednesday, February 16, 2011 11:36 AM
  • Yes, some! After moving all of our VMs to a single host, the problem disappeared. I've been working with Microsoft on this and they've essentially said they can't do anything else until it happens again. I moved a single VM back to the second host at the beginning of this week, so we'll see how she fares. I'll post as to any updates as well as case notes if this works out. Sorry I don't have more for you!

     - Liam

    Thursday, February 17, 2011 4:16 PM
  • Liso, any new 411 with your MS case?  We have been dealing with this for some time in our environment:

    • 2 node 2008R2 SP1 with HIT disk/ASM/Hardware provider drivers for SAN (3.5.1) installed
    • CSV heartbeat on separate 1Gb interface
    • iSCSI on separate 10Gb interface
    • DPM 2010 (3.0.7707.0) on a 2008R2 SP1 physical server, iSCSI attached
    • Dell EqualLogic PS6510 (5.0.4)

    The event 1135 hits the same node random by day but specifically by the starting hour of our DPM backups…  All VM’s force migrate to the other node.  Does anyone have a properly working hotfix from MS on this???

    Tuesday, August 30, 2011 4:34 PM
  • My apologies - I thought I'd updated this. We installed this hotfix: http://support.microsoft.com/default.aspx?scid=kb;en-US;2494162 and the problem hasn't returned. Hope this helps!

     - Liam

    • Proposed as answer by liso Tuesday, August 30, 2011 6:58 PM
    Tuesday, August 30, 2011 6:52 PM
  • In addition to the other fixes already mentioned, I'll throw a couple more links in the mix, for those still working on this issue:

     

    For those using hardware VSS providers, see this list:

    http://blogs.technet.com/b/dpm/archive/2010/02/05/tested-hardware-vss-provider-table.aspx

    NOTE:  From what I've seen around the web, hardware VSS providers must be written to specifically support snapshotting for Cluster Shared Volumes.  Just because the vendor has a hardware VSS provider, doesn't mean it works properly with CSV in Hyper-V.  Check with the storage vendor if you are unsure of CSV support in their hardware VSS provider!

     

    I also recommend this blog to keep up-to-date on the latest hotfixes for Hyper-V/Failover Clustering/DPM issues:

    http://www.hyper-v.nu/

     

    See this post in particular for hotfixes related directly to DPM 2010 and Hyper-V R2:

    http://www.hyper-v.nu/archives/hvredevoort/2010/02/hyper-v-r2-hotfixes-for-dpm-2010/

    Thursday, January 19, 2012 1:32 PM