none
Storage timeout in Failover Cluster Manager

    Întrebare

  • Environment:

    EMC Unity 500F SAN

    ESXi 6.0.0 build 3620759

    Virtual Machines - 2 node cluster with Windows Server 2012 R2 and SQL 2014

    Raw Device Mapping to storage with physical SCSI

    Issue:

    When doing a UnityOS SAN upgrade, the LUNs failover to the other storage processor, while the LUN owner SP reboots.  This failover should be non-disruptive, and the server should not notice, due to multiple paths to the SPs.  During the SP failover, the cluster gets the following Physical Disk Resource error for each disk and moves resources to the other node.

    “Ownership of cluster disk 'diskname scrubbed' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.”

    At the same time, the System event log shows the following error for all drives/mountpoints –

    “The system failed to flush data to the transaction log. Corruption may occur in VolumeId: (drive ID scrubbed), DeviceName: \Device\HarddiskVolume18.

    ({Device Busy}

    The device is currently busy.)”

    These failovers/resource moves cause application outages.

    An SR was opened with EMC, and this was their response – After reviewing the Unity logs we can see the Lun's trespassing as expected during the upgrade and no issues are reported.

    EMC suggested there might be an issue with the SQL/Cluster application surviving a link being down.

    Is there a way to adjust the storage timeout setting in SQL and/or Failover Cluster Manager to stop a failover?

    Any help is appreciated!
    marți, 29 mai 2018 18:15

Toate mesajele

  • Hi, 

    Based on my understanding, when the LUNs failover to the other storage processor, the Windows Cluster get error “Ownership of cluster disk 'diskname scrubbed' has been unexpectedly lost by this node. Run the Validate a Configuration wizard to check your storage configuration.”
    From Windows Cluster point of view, when one node lose connection to the cluster disks, it will move the cluster disks to another node in the cluster and try to bring the disks online. It is a normal cluster failover mechanism. From the cluster logs, we can only know the node lose connection to the storage, as why the node lose connection to the storage, we need to involve the storage vendor for troubleshooting.

    Best Regards, 


    Frank

    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com


    miercuri, 30 mai 2018 08:29