none
Hyper-V Service Cluster restart during DPM 2010 backup RRS feed

  • Question

  • Hello,

     

    I’m having issues with DPM backups on CSV volumes.

     

    I’m trying to backup 2 or more VM concurrently on a Hyper-V cluster with 2 nodes.

    I have the following problem: when I backup 2 VM that are on 2 different nodes, but on the same CSV, the Failover Clustering service restarts on one of the nodes.

    Of course, this causes a failover to the remaining host and downtime for the VMs…

     

    No issue if backed-up VMs are on the same node or on a different CSV.

     

    The situation is very similar to the one described in the following KB : http://support.microsoft.com/kb/975354

    But the hotfix is already installed on both nodes…

     

    Environment :

    ·         Hyper-V OS : Windows 2008 Datacenter R2

    ·         SAN : EqualLogic PS6000

    ·         SAN Hardware Snapshot Provider : DELL EqualLogic VSS HW Provider - version 3.3.1.4944

    ·         DPM : DPM 2010 RTM

      

    Below you will find the errors found on a host that presents the issue.

     

    Event 5121 at 2:00:34 AM

    Cluster Shared Volume 'Volume1' ('EQLPRD3') is no longer directly accessible from

    this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

     

    Event 1038 at 2:01:44 AM

    Ownership of cluster disk 'EQLPRD2' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

     

    Event 1038 at 2:01:45 AM

    Ownership of cluster disk 'Quorum' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.

     

    Event 4201 at 2:01:46 AM

    Isatap interface isatap.{55DFCEFD-ED71-4C57-9277-8FBD5219D184} is no longer active.

     

    Event 7031 at 2:01:51 AM

    The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

     

    Has anyone seen and solved this problem or is it a normal behavior (which I doubt) ?

    Thanks for the help.

    • Moved by Praveen D [MSFT] Monday, July 19, 2010 6:37 AM Moving to DPM Hyper-V Protection Forum (From:Data Protection Manager)
    Thursday, July 1, 2010 1:29 PM

Answers

All replies

  • Hello,

     

    When DPM does a backup of a guest located on a CSV, the CSV is moved to the node running the guest, then a snapshot is taken of the CSV and the backup is performed from the mounted snapshot.  In your case, you have a VSS hadware provider, so the snapshot should take no longer than 1 or 2 minutes at very most.  During that brief time period of the snapshot being created, the VM's runing on that CSV will go into redirection I/O mode and once the snapshot is created go back into direct I/O Mode.  At this time, if another VM is scheduled to be backed up, the CSV will be moved to the host running that VM and the process repeats.

     

    What it sound like to me is that the network you have between the cluster nodes is not able to handle the redirection mode I/O so the cluster fails.  Please review your network topology / speed, configuration.


    Regards, Mike J [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Thursday, July 1, 2010 2:31 PM
    Moderator
  • Hello,

    I am a colleague of the  topic creator and I'm following this issue as well. Thank you for your answer.

     

    We use a 2 GBps NIC team for the CSV network so it is not very likely that the network bandwidth is insufficient. Furthermore the VMs in question are also mostly inactive so trafic is minimal (we are still in the early period of the virtualization process) and they are few (less than 10).

    We will test bandwidth though.

     

    Even if it is the case, is it expected that the cluster service terminate unexpectedly ?

    I would understand that all VM on this CSV are slowed down or may even become unavailable, but not this behavior.

     

    Thanks for the help !

    Thursday, July 1, 2010 2:58 PM
  • During that brief time period of the snapshot being created, the VM's runing on that CSV will go into redirection I/O mode and once the snapshot is created go back into direct I/O Mode.  At this time, if another VM is scheduled to be backed up, the CSV will be moved to the host running that VM and the process repeats.

    just to confirm :

    does that means that if DPM tries to launch a VM backup on a node with the VM's CSV on I/O redirected mode, it will wait for the current owner node to "let go" of the CSV before asking the 2nd node to get ownership of it ?

    Thursday, July 1, 2010 3:07 PM
  • Hello,

    There is a Windows 2008 hotfix to prevent that senario from happening - it is a prerequisite before DPM can protect hyper-V guests. Specifics for the fix are outlined in the below article.

    975354 A Hyper-V update rollup package is available for a computer that is running Windows Server 2008 R2
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;975354

    More information on CSV and DPM: http://technet.microsoft.com/en-us/library/ff634189.aspx

     

    As far as the cluster service crashing, that is not good or normal.  I would, recommend installing the latest Windows 2008 fix that includes the clussve.exe and see if that helps.

    978001 Cluster resources do not automatically fail over to another node when you disconnect the private and public network interfaces in a Windows Server 2008 failover cluster or in a Windows Server 2008 R2 failover cluster
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;978001

     


    Regards, Mike J [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Thursday, July 1, 2010 5:00 PM
    Moderator
  • I had the exact same error. But I have all available updates installed on my cluster nodes.

    My solution was to uninstall Equallogic hardware VSS provider. I enabled serialization as per the following technet article: http://technet.microsoft.com/en-us/library/ff634192.aspx

    After this backups and cluster where running fine.

     I am going to open a case with Equallogic support.

    Tuesday, July 20, 2010 3:33 PM
  • I beleive that the following would fix your issue:

    2277439 The Cluster service stops responding if you run backup applications in parallel in Windows Server 2008 R2
    http://support.microsoft.com/default.aspx?scid=kb;en-US;2277439


    Cheers, Tyler F [MSFT] - This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, September 3, 2010 8:12 PM
    Moderator
  • If the above suggested answer does not help in resolving the thread please re-open it
    Cheers, Tyler F [MSFT] - This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, September 3, 2010 8:12 PM
    Moderator
  • Hello,


    I am having the same problem - specifically

    Log Name:      System
    Source:        Microsoft-Windows-Iphlpsvc
    Date:          8/27/2012 6:01:42 PM
    Event ID:      4201
    Task Category: None
    Level:         Information
    Keywords:      
    User:          SYSTEM
    Computer:      NGKEAHVHR7.KETSRD.ketsds.net
    Description:
    Isatap interface isatap.{D06CD3FF-4CEE-4A09-A524-8F7CB4D62278} is no longer active.

         followed by

    Log Name:      System
    Source:        Service Control Manager
    Date:          8/27/2012 6:01:43 PM
    Event ID:      7031
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      NGKEAHVHR7.KETSRD.ketsds.net
    Description:
    The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart theservice.

    on a two-node 2008 R2 SP1 (which includes the fixes referenced above) cluster using EqualLogic SAN and DPM 2010.  Any suggestions on how to further diagnose or correct?

    thanks!

    Martin

    Tuesday, August 28, 2012 3:20 PM