none
Partition information lost on cluster shared disk RRS feed

  • Question

  • Hi everyone,


    we've got a cluster virtual disk where the partition table and volume name broke. Has anyone experienced a simliar problem and got some hints on how to recover?


    The problem occured last friday. I restarted node3 for windows updates. During the restart node1 had a bluescreen and also restarted. The failover cluster manager tried to bring online the cluster resources but failed several times. Finally the resource-swapping came to a rest on node1 which came up early after the crash. Many virtual disks were in an unhealthy state, but the repair process managed to repair all disks so they are now in a healthy state. We aren't able to explain why node1 crashed. Since the storage pool is in dual parity mode the disks should be able to work even if there are only 2 nodes running.

    One virtual disk, however, lost its partition information.


    Network config:

    Hardware: 2x Emulex OneConnect OCe14102-NT, 2x Intel(R) Ethernet Connection X722 for 10GBASE-T

    Backbone-Network: On the "right" Emulex network card (only members in this subnet are the 4 nodes)

    Client-access teaming network: emulex "left" and intel "left" cards in team; 1 untagged network and 2 tagged networks


    Software Specs:

      • Windows Server 2016
      • Cluster with 4 Clusternodes
      • Failover Cluster Manager + File Server Roles running on the cluster
      • 1 Storagepool with 36 HDDs / 12 SSDs (9HDD / 3 SSD on each node
      • Virtual disks are configured to use dual parity:
      Get-VirtualDisk Archiv | get-storagetier | fl
      •    FriendlyName           : Archiv_capacity
      •    MediaType              : HDD
           NumberOfColumns        : 4
           NumberOfDataCopies     : 1
           NumberOfGroups         : 1
           ParityLayout           : Non-rotated Parity
           PhysicalDiskRedundancy : 2
           ProvisioningType       : Fixed
           ResiliencySettingName  : Parity

      Hardware Specs per Node:

      • 2x Intel Xeon Silver 4110
      • 9HDDs à 4 TB and 3 SSD à 1 TB
      • 32GB RAM on each node

    Additional information:

    The virtualdisk is currently in Healthy state:

    Get-VirtualDisk -FriendlyName Archiv

    FriendlyName ResiliencySettingName OperationalStatus HealthStatus IsManualAttach   Size

    ------------ --------------------- ----------------- ------------ --------------   ----
    Archiv                             OK                Healthy      True           500 GB


    The storagepool is also healthy:

    PS C:\Windows\system32> Get-StoragePool
    FriendlyName   OperationalStatus HealthStatus IsPrimordial IsReadOnly

    ------------   ----------------- ------------ ------------ ----------
    Primordial     OK                Healthy      True         False
    Primordial     OK                Healthy      True         False
    tn-sof-cluster OK                Healthy      False        False


    Since the incident the event log (of current master: Node2) has various errors for this disk like:

    [RES] Physical Disk <Cluster Virtual Disk (Archiv)>: VolumeIsNtfs: Failed to get volume information for \\?\GLOBALROOT\Device\Harddisk13\ClusterPartition2\. Error: 1005.


    Before the incident we also had errors that might indicate a problem:

    [API] ApipGetLocalCallerInfo: Error 3221356570 calling RpcBindingInqLocalClientPID.


    Our suspicions so far:

    We did registry changes to: SYSTEM\CurrentControlSet\Control\Class\{4d36e972-e325-11ce-bfc1-08002be10318}\0001 (to 0009) and set the value PnPCapabilities to 280 (disabling the checkbox "Allow the computer to turn off this device to save power") but not all network adapters support this checkbox so this may have had some side effects)



    One curiosity: after the error we noticed that one of the 2 tagged networks had the wrong subnet on two nodes. This may have caused some of the failover role switches that occured on friday, but we're unsure about the reason since they were configured correctly some time before.

    We've had a similar problem in our test environment after activating jumbo frames on the network interfaces. In that case we lost more and more filesystems after moving the file server role to another server. In the end all filesystems were lost and we reinstalled the whole cluster without enabling jumbo frames.

    We now suspect that maybe two different network cards in the same network team may cause this problem.

    What are your ideas? What may have caused the problem and how can we prevent this from happening again?

    We could endure the loss of this virtual disk since it was only archive data and we have a backup, but we'd like to be able to fix this problem.

    Best regards

    Tobias Kolkmann


    Wednesday, June 12, 2019 4:33 PM

All replies

  • Hello,

    Thank you for posting in our forum!

    1. Please check that if there is any related error message on node1 Event Viewer?

    2. Please run cluster validation and upload the report to our forum.

    3. Please check whether the update of Node 3 is applicable to the system, and try to install the update for the other three nodes as well.

    Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, June 13, 2019 7:21 AM
    Moderator
  • Hi Daniel,


    thanks for your reply.

    1. Errors related to which problem? The bluescreen was an NTOSKRNL error with bug check code 0x00000133 (the file system shouldn't be damage anyway since up to 2 nodes can be offline while the resources should still be available). There are lots of errors when node1 tried to host the cluster role but they are similar to those on node2 (current master), so I don't really know which events are most interesting. Cluster - regarding events haven't occured one node1 since node2 is master. The last event regarding the file system is:

    Volume  is formatted as ReFS but ReFS is unable to mount it; ReFS encountered status A data integrity checksum error occurred. Data in the file stream is corrupt..

    This occured when node1 was role master. There is no hint on how to solve this error, though...


    2. Going to post the warnings below.


    3. The update 05-2019 cumulative has been installed on all nodes. The cluster validation report and windows update indicate that the update is missing on node 2 and 4, but this is probably due to the buggy nature of this special update: we've had several servers that wanted to install the update a second time although it has already been installed successfully the last time.



    Cluster validation report warnings:

    --------------------------------

    Validating cluster resource NFS-tn-sof.

    This resource is not configured to use the standard 'Pending timeout' value for the resource type. 'Pending timeout' specifies the length of time the resource can take to change states between Online and Offline before the Cluster service puts the resource in the Failed state. The default setting is generally recommended, but there may be situations where other values may be preferable. This setting can be configured in Failover Cluster Manager by selecting 'Properties' and then selecting the 'Policies' tab.

    Comment: NFS has always been running "unstable" and took a long time to start after moving the role to another node, so me tried some adjustments to counter this problem. NFS usually takes 20-60 Minutes to start successfully on the active node. Shouldn't cause any file system corruptions, though.

    --------------------------------

    Software Updates missing on 'tn-sof-02.[domain]':

    Software Updates missing on 'TN-SOF-04.[domain]':

    As mentioned above, the

    2019-05 Cumulative Update for Windows Server 2016 for x64-based Systems (KB4494440)

    is already installed on all systems and node2 and node4 both show a successful installation in the update log.

    --------------------------------

    --------------------------------

    All in all the cluster seems to be pretty healthy - regarding to the validation wizard. But the missing partition information is a major issue. Btw: The cluster validation wizard hasn't been very reliable in the past. Sometimes no problems at all were found but some minutes later - without changing any settings - there were errors like:

    --------------------------------

    List Storage Enclosures
    Description: List all enclosures and their health status.

    Start: 14.02.2019 16:26:33.

    An error occurred while executing the test.
    One or more errors occurred.

    ERROR CODE : 0x80131500;
    NATIVE ERROR CODE : 1.
     The I/O operation has been aborted because of either a thread exit or an application request.

    Comment: Maybe the validation runs in timeouts when there is a high load?

    --------------------------------

    --------------------------------

    Best regards

    Tobias

    Thursday, June 13, 2019 9:36 AM
  • Maybe we can begin with a question that is not quite as complex: Are our dual parity disks configured correctly?

    PS C:\Windows\system32> Get-VirtualDisk |  ft FriendlyName,IsEnclosureAware,ResiliencySettingName,NumberOfAvailableCopies, NumberOfColumns, NumberOfDataCopies, OperationalStatus
    
    FriendlyName IsEnclosureAware ResiliencySettingName NumberOfAvailableCopies NumberOfColumns NumberOfDataCopies OperationalStatus
    ------------ ---------------- --------------------- ----------------------- --------------- ------------------ -----------------
    Archiv                                                                                                         OK
    admin_csv    False            Parity                                        4               1                  OK
    NFS-Space                                                                                                      OK
    
    
    PS C:\Windows\system32> Get-VirtualDisk |  get-storagetier | ft FriendlyName,IsEnclosureAware,ResiliencySettingName,NumberOfAvailableCopies, NumberOfColumns, NumberOfDataCopies, OperationalStatus
    
    FriendlyName         IsEnclosureAware ResiliencySettingName NumberOfAvailableCopies NumberOfColumns NumberOfDataCopies OperationalStatus
    ------------         ---------------- --------------------- ----------------------- --------------- ------------------ -----------------
    Archiv_capacity                       Parity                                                      4                  1
    NFS-Space_capacity                    Parity                                                      4                  1

    The virtual disks "Archiv" and "NFS-Space" were created with the server manager. The server manager dispays the information that this is indeed a dual parity configuration. admin_csv was created manually with New-VirtualDisk, thus there is no storagetier-reference.

    How important is the option "IsEnclosureAware"? Theoretically the cluster disks shouldn't have any problems even if two nodes are offline due to the dual parity - mode. But in fact we did lose one file system which may have happened when two nodes were offline, so maybe there is a conceptual problem we haven't configured correctly.

    Our expectation is that any dual parity disk will survive the failure of up to 2 servers/storage enclosures. Important Storagepool-information:

    PS C:\Windows\system32> Get-StoragePool tn-sof-cluster | ft FriendlyName, EnclosureAwareDefault, HealthStatus, IsClustered, OperationalStatus, RepairPolicy, ResiliencySettingNameDefault
    
    FriendlyName   EnclosureAwareDefault HealthStatus IsClustered OperationalStatus RepairPolicy ResiliencySettingNameDefault
    ------------   --------------------- ------------ ----------- ----------------- ------------ ----------------------------
    tn-sof-cluster                 False Healthy             True OK                Parallel     Mirror

    Is our configuration correct to meet our expectation? And... do we need to manually configure EnclosureAwareness? Just wondering because if the cluster doesn't know about Storage Enclosures it may use up to 4 disks/columns on one enclose and therefore risking data loss if this enclosure crashes.

    Best regards

    Tobias

    Thursday, June 13, 2019 2:47 PM
  • Hi,

    I need some time to study this problem. I'll contact you as soon as I have a progress.

    Thank you for your understanding!

    Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, June 17, 2019 8:43 AM
    Moderator
  • Hi Daniel,

    ist it still work in progress? No pressure intended, just wanted to check.

    Best regards

    Tobias

    Tuesday, June 25, 2019 3:52 PM
  • *push*

    Has anyone got an idea? If not, where can we get support? The professional support declines root cause analysis and data recovery, yet this seems is a critical issue - just noone feels responsible.

    Premier Support (https://partner.microsoft.com/en-us/support/microsoft-services-premier-support) was recommended. I haven't looked into the details yet but it's probably an expensive premium Service for general purpose but not for a singular special issue, correct?

    Seems a little unfair though that if the cluster doesn't work correctly and the "victims" have to pay double (lost data + premium support). Can't see any serious configuration problems on our part that may have caused the problem.

    The cluster seems so be getting slower each time to change the role master. Virtual Disks tend to be running into a timeout before they go online on second attempt. Yesterday another disk was without partition information for around 5-10 minutes before it automatically corrected itself.

    Each time we change the master role we fear that another file system may get lost.

    Best regards

    Tobias

    Wednesday, July 3, 2019 12:38 PM