none
Draining the last 2012r2 node in a 3 node cluster when migrating to 2016 took the entire cluster offline due to disks going offline - any ideas? RRS feed

  • Question

  • I have a 3 node HP cluster with each node directly attached to a JBOD device. Each node shows 4 CSV's that are all healthy.

    I upgraded 2 of them from 2012R2 by draining node, reinstalling 2016 and then adding them back to the cluster and everything worked fine.

    Started the same process off today and all the disks went offline along with all the machines in the cluster apart from 2 still on the original host that failed to migrate.

    Got everything back by unpausing and fail back of the machines and restarting all the failed migrations.

    Needless to say, that was not my favourite part of the day.

    Event details for the drives show events such as 

    ===============

    Cluster Shared Volume 'Volume1' ('Cluster Virtual Disk (Volume 1)') has entered a paused state because of '(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    Cluster resource 'Cluster Virtual Disk (Volume 1)' of type 'Physical Disk' in clustered role '1912d37e-0360-434e-9212-3083db0d23fb' failed. The error code was '0x2' ('The system cannot find the file specified.').

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    Cluster physical disk resource online failed.

    Physical Disk resource name: Cluster Virtual Disk (Volume 1)
    Device Number: 4294967295
    Device Guid: {00000000-0000-0000-0000-000000000000}
    Error Code: 3224895541
    Additional reason: AttachSpaceFailure

    Cluster resource 'Cluster Virtual Disk (Volume 1)' of type 'Physical Disk' in clustered role '1912d37e-0360-434e-9212-3083db0d23fb' failed. The error code was '0xc0380035' ('The pack does not have a quorum of healthy disks.').

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    =======================

    The pools are owned by this last node.

    After bringing the last node online, I saw that all the CSV's showed an operation status  of regenerating for a while before going back to healthy.

    Anyone seen this behaviour before and have some pointers of anything that I might have missed in the process?


    http://absoblogginlutely.net

    Thursday, January 24, 2019 5:43 PM

All replies

  • Hi,

    Thanks for your question.

    Started the same process off today and all the disks went offline along with all the machines in the cluster apart from 2 still on the original host that failed to migrate.>>>

    I’m a little confused about this sentence. You mean the disks and machines on one original node can be online, others gone offline?   

    Please try the following steps to see if it works.

    1Please check the connection between nodes and JBOD device.

    2While we roll upgrade the OS, we also need to check if the drivers of the adapters for JBOD connection on the OS is correct and up to date.  

    3Please run the cmdlets “Diskpart” to check the current disks and volumes’ state mounted on the owner node. Or we can monitor the state in the disk management on owner node by GUI, so that we can collect more information.

    diskpart list disk
    
    diskpart list volume

    Hope this helps. If you have any question or concern, please feel free to let me know.

    Best regards,

    Michael


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Friday, January 25, 2019 6:21 AM
    Moderator
  • Thanks for the reply Michael,

    Very shortly after doing a pause/drain roles command on the server  all the cluster shaved volumes on the other two nodes went offline taking down the virtual machines that were on them. All but 2 servers on this machine went into failed state as they were in the migration queue and the other 2 went into a warning state as they had not gone through the migration - I was left with 2 servers up plus the 3 hosts. However as soon I unpaused the server a couple of the machines on the server auto started to boot successfully. After a couple of minutes the csv volumes drives became available on all cluster nodes and I was able to restart all machines again.

    As the csv's came back online after the resume command, I'm not convinced it's a hardware connection between the nodes and jbod. I haven't even got to changing the OS on the server either so it's unlikely to be a driver issue as that hasn't changed on this machine yet. The 2 upgraded hosts have been running for over a week with no issues 

    DISKPART> list disk

     Disk ###  Status         Size     Free     Dyn  Gpt
     --------  -------------  -------  -------  ---  ---
     Disk 0    Online          136 GB      0 B        *
     Disk 25   Reserved       4086 GB      0 B        *
     Disk 26   Reserved       4094 GB      0 B        *
     Disk 27   Reserved       6636 GB      0 B        *
     Disk 28   Reserved       4364 GB      0 B        *
    DISKPART> list volume

     Volume ###  Ltr  Label        Fs     Type        Size     Status     Info
     ----------  ---  -----------  -----  ----------  -------  ---------  --------
     Volume 0     C                NTFS   Partition    136 GB  Healthy    Boot
     Volume 1         Recovery     NTFS   Partition    300 MB  Healthy    Hidden
     Volume 2                      FAT32  Partition     99 MB  Healthy    System
     Volume 3         Volume 2     CSVFS  Partition   4085 GB  Healthy
       C:\ClusterStorage\Volume2\
     Volume 4         Volume 1     CSVFS  Partition   4093 GB  Healthy
       C:\ClusterStorage\Volume1\
     Volume 5         Volume 3  (  CSVFS  Partition   6635 GB  Healthy
       C:\ClusterStorage\Volume3\
     Volume 6         Volume 4 (f  CSVFS  Partition   4363 GB  Healthy
       C:\ClusterStorage\Volume4\

    DISKPART>

    I won't be able to attempt the upgrade again until another maintenance window becomes available in a couple of weeks but I'm hoping to get some ideas about what could have caused it before I attempt again.


    http://absoblogginlutely.net

    Friday, January 25, 2019 12:06 PM
  • Bump - any other suggestions anyone?

    http://absoblogginlutely.net

    Thursday, January 31, 2019 6:13 PM
  • Just an update that I'm still having this issue. A ticket has been logged with Microsoft PSS and I'm currently waiting on them to build a lab environment to reproduce this issue. In the meantime, this particular cluster server cannot be drained or it takes the servers offline.

    http://absoblogginlutely.net

    Monday, March 25, 2019 4:55 PM
  • 3 months later, I'm still trying to get somewhere with PSS.

    Today we noticed that the two new 2016 servers do not show any of the clustered shared volumes when going to computer Management/Storage/Disk Management. The drives show up as CSVFS correctly on the 2012R2 server

     Running get-virtualdisk shows all drives on all of the servers correctly as follows.

    S C:\Windows\system32> Get-VirtualDisk

    FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach    Size
    ----------- --------------------- ----------------- ------------ --------------    ----
    Volume 1     Mirror                OK                Healthy      True              4 TB
    Volume 2     Mirror                OK                Healthy      True           3.99 TB
    Volume 4     Mirror                OK                Healthy      True           4.26 TB
    Volume 3     Parity                OK                Healthy      True           6.48 TB

    All 3 servers show (near enough) identical data with all the drives with get-physicaldisk. The 2012 server doesnt contain drive serial numbers but 2016 does.

    PS C:\Windows\system32> get-physicaldisk | sort friendlyname

    FriendlyName   SerialNumber         CanPool OperationalStatus HealthStatus Usage            Size
    ------------   ------------         ------- ----------------- ------------ -----            ----
    PhysicalDisk1  Z1Y1DTLB0000C420AEWF False   OK                Healthy      Auto-Select   1.82 TB
    PhysicalDisk10 Z1Y1BADH0000C420AASA False   OK                Healthy      Auto-Select   1.82 TB
    PhysicalDisk11 Z1X13VQG00009414W8KX False   OK                Healthy      Auto-Select   1.82 TB
    PhysicalDisk12 Z1Y1B0PY000094198L1T False   OK                Healthy      Auto-Select   1.82 TB
    PhysicalDisk13 Z1Y1BLQ40000C420AAAP False   OK                Healthy      Auto-Select   1.82 TB
    PhysicalDisk14 Z1Y1B1GJ0000C420AC48 False   OK                Healthy      Auto-Select   1.82 TB

    <snip>

    The only thing I can find at the moment on the 2016 servers is when I go to File and storage services, Volumes, Disks, I get errors for both of the 2016 servers - 

    Error occurred during storage enumeration - The requested operation is not supported.

    There is also "The server pool does not contain all cluster members. Incomplete communication with CLUSTER. The following cluster nodes or clustered roles might be offline or have connectivity issues:CAUCLUSTERhcr. To the best of my knowledge thats part of Cluster aware patching. However we had to disable patching on the servers too because draining a server to patch would take the entire environment offline.

    Anyone got any ideas why the entire cluster drives would disappear when the Owner node is drained?


    http://absoblogginlutely.net

    Tuesday, April 23, 2019 9:59 PM