none
SAN HPE SV3200 iScsiIPrt errors crashing VMs and after cascade also Fail Over Server Nodes? RRS feed

  • Question

  • We have a three node W2012 R2 Fail over cluster that has been running spotless for years with the HPE P4300 SAN but after adding the HPE Storevirtual SV3200 as a new SAN we are having iScsiPrt errors that HPE Support cannot fix, crashing VMs and also two of the three fail over nodes. 

    At first everything seemed to work, but after adding additional disks on the SAN a SAN controller crashed. That has been replaced under warranty but now when moving our servers and especially SQL 2008 Servers to the SAN, problems start to occur. The VHDX volumes of the SQL servers are thin provisioned.  

    Live moving of the storage worked fine for none SQL servers. For some SQL servers the servers frooze and operation was halted, so we needed to perform an offline move. Then during high disk IO and especially during backups W2012 R2 FOC started to behave erratic eventually crashing VMs and in one instance rebooting two fail over nodes, as a result of a flood of iScsciPrt errors in the eventlog:

    System iScsiPrt event ID 27 error Initiator could not find a match for the initiator task tag in the received PDU. Dump data contains the entire iSCSI header.
    System iScsiPrt event 129 warning The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

    If the event originated on another computer, the display information had to be saved with the event.

    The following information was included with the event:

    \Device\RaidPort4

    the message resource is present but the message is not found in the string/message table

    System iScsiPrt event ID 39 error Initiator sent a task management command to reset the target. The target name is given in the dump data.
    System iScsiPrt event ID 9 error Target did not respond in time for a SCSI request. The CDB is given in the dump data.
    System iScsiPrt event 129 warning The description for Event ID 129 from source iScsiPrt cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

    If the event originated on another computer, the display information had to be saved with the event.

    The following information was included with the event:

    \Device\RaidPort4

    the message resource is present but the message is not found in the string/message table
    System iScsiPrt event ID 27 error Initiator could not find a match for the initiator task tag in the received PDU. Dump data contains the entire iSCSI header.
    System FailOverClustering event id 5121 Information Cluster Shared Volume 'Volume4' ('NEMCL01_CSV04') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network to the node that owns the volume. If this results in degraded performance, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.

    After a 2 hour period of these events the FailOver Cluster services started to give errors, VMs failed and finally 2 nodes of our 3 node failover cluster rebooted because of a crash.

    Sofar HPE has not been able to fix this. The SV3200 logs has occasional ISCSI controller errors but the error logging in the SVMC is minimal. 

    HPE support blamed using a VIP and using Sites (a label). Both are supported according to the HPE product documentation. This has been removed and ISCSI initiator has been set to the Eth0 bond IP adresses directly. As problems persist they blamed that we are using the Lefthand DSM MPIO driver on the initiator connections to the SV3200 which is not the case. Standard MS DSM. Yes the Lefthand driver is on the system for our old SAN but not configured for the SV3200 initiator sessions, which is round robin with supset.     

    We  are currently facing a legal warranty standoff.

    Any pointers  or other comparable experiences with the HPE Storevirtual SV3200 SAN?

    TIA,

    Fred

    Wednesday, January 9, 2019 9:37 AM

All replies

  • Yesterday I started a HPE remote support session at 14:30 which ended in the scenario as described before.  Altough we were seeing occassional iScsiPrt and MPIO errors in Windows eventlog and IscsiLink down and Storage Controller failing over, everything was still OK according to HPE when we performed a live move of a SQL server. 

    Then the cascade effect started at around 20:57, the SV3200 SAN creates iScsiPrt errors 129, 39, 9, Disk error 153 which lead to:

    Critical error FailOverClustering 1146  The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

    Information Iphlpsvc 4201 Isatap interface isatap.{74343F0E-B0A7-4713-AF86-1F56E7636E5D} is no longer active. 

    At this point the affected Clustered San Volume CSV03 is no longer present in the FailOverCluster Manager. CSV01 and CSV02 are on our operational older P4300 SAN.

    This I understand. But what I do not understand that this situation can create a runaway effect in which all Cluster Resources are no longer visible in the FailOverCluster Manager! It starts with the disk from the affected SAN then the disks from the old SAN, the VM's ofcourse, then the nodes and then the Cluster itself. A f*#@$g nightmare. 

    Windows should be hardened against disk failures or not?

    The only way to fix this is to reboot all Cluster Nodes after which everything is present again if nothing happens (as long as there is no traffic to the affected SAN).

    Evntlog from SVR03 which is a standby node without load:

    Information 9-1-2019 21:01:29 Service Control Manager 7036 None

    The Cluster Service service entered the stopped state.

    Error 9-1-2019 21:01:29 Service Control Manager None

    The Cluster Service service terminated with the following service-specific error: 
    The semaphore timeout period has expired.

    Error 9-1-2019 21:01:29 Service Control Manager 7031 None

    The Cluster Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 60000 milliseconds: Restart the service.

    Information 9-1-2019 21:01:29 Service Control Manager 7036 None

    The SMB Witness service entered the stopped state.

    Warning 9-1-2019 21:01:30 disk 153 None

    The IO operation at logical block address 0x4b8cb657 for Disk 1 (PDO name: \Device\MPIODisk6) was retried.

    Information 9-1-2019 21:02:12 WMI 5605 None

    The root\mscluster namespace is marked with the RequiresEncryption flag. Access to this namespace might be denied if the script or application does not have the appropriate authentication level. Change the authentication level to Pkt_Privacy and run the script or application again.

    Error 9-1-2019 21:02:19 Foundation Agents 1172 Events

    Cluster Agent: The cluster service on SVR03 has failed. 
    [SNMP TRAP: 15004 in CPQCLUS.MIB]

    Information 9-1-2019 21:02:30 Kernel-General 16 None

    The access history in hive \??\C:\Windows\Cluster\CLUSDB was cleared updating 174 keys and creating 23 modified pages.

    Information 9-1-2019 21:02:30 Service Control Manager 7036 None

    The Cluster Service service entered the running state.

    Information 9-1-2019 21:02:51 Service Control Manager 7036 None

    The WinHTTP Web Proxy Auto-Discovery Service service entered the running state.

    Information 9-1-2019 21:02:52 DFSR 9112 None

    The description for Event ID 9112 from source DFSR cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.

    If the event originated on another computer, the display information had to be saved with the event.

    The following information was included with the event: 

    Information 9-1-2019 21:02:54 Service Control Manager 7036 None

    The Volume Shadow Copy service entered the running state.

    Warning 9-1-2019 21:03:26 Ntfs (Microsoft-Windows-Ntfs) 140 None

    The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CL01_CSV03, DeviceName: \Device\HarddiskVolume20.
    (The I/O device reported an I/O error.)

    Warning 9-1-2019 21:03:26 Ntfs (Microsoft-Windows-Ntfs) 140 None

    The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CL01_CSV03, DeviceName: \Device\HarddiskVolume20.
    (A device which does not exist was specified.)

    Information 9-1-2019 21:03:26 Service Control Manager 7036 None

    The Device Setup Manager service entered the running state.

    Information 9-1-2019 21:03:26 Iphlpsvc 4200 None

    Isatap interface isatap.{74343F0E-B0A7-4713-AF86-1F56E7636E5D} with address fe80::5efe:169.254.1.77 has been brought up.

    Error 9-1-2019 21:04:19 Foundation Agents 1172 Events

    Cluster Agent: The cluster service on SVR02 has failed. 
    [SNMP TRAP: 15004 in CPQCLUS.MIB]

    Information 9-1-2019 21:06:28 Windows Error Reporting 1001 None

    Fault bucket , type 0
    Event Name: Failover clustering resource deadlock
    Response: Not available
    Cab Id: 0

    Problem signature:
    P1: CL01_CSV03
    P2: Physical Disk
    P3: ONLINERESOURCE
    P4: 
    P5: 
    P6: 
    P7: 
    P8: 
    P9: 
    P10: 

    Attached files:
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\Critical_CL01_CSV03_48da52a59bdfa677c412bca6717bdf5e31e77433_00000000_cab_637d3b8c\memory.hdmp
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\Critical_CL01_CSV03_48da52a59bdfa677c412bca6717bdf5e31e77433_00000000_cab_637d3b8c\minidump.mdmp

    These files may be available here:
    C:\ProgramData\Microsoft\Windows\WER\ReportQueue\Critical_CL01_CSV03_48da52a59bdfa677c412bca6717bdf5e31e77433_00000000_cab_637d3b8c

    Analysis symbol: 
    Rechecking for solution: 0
    Report Id: 0bbd79b3-144a-11e9-80c2-fa72e20e28e6
    Report Status: 4
    Hashed bucket: 

    Warning 9-1-2019 21:09:05 Foundation Agents 1167 Events

    Cluster Agent: The cluster resource NEMCL01_WitnessDisk has become degraded. 
    [SNMP TRAP: 15005 in CPQCLUS.MIB]

    Error 9-1-2019 21:09:05 FailoverClustering 1230 Resource Control Manager

    A component on the server did not respond in a timely fashion. This caused the cluster resource 'CL01_CSV03' (resource type 'Physical Disk', DLL 'clusres.dll') to exceed its time-out threshold. As part of cluster health detection, recovery actions will be taken. The cluster will try to automatically recover by terminating and restarting the Resource Hosting Subsystem (RHS) process that is running this resource. Verify that the underlying infrastructure (such as storage, networking, or services) that are associated with the resource are functioning correctly.

    Error 9-1-2019 21:11:20 Foundation Agents 1168 Events

    Cluster Agent: The cluster resource CL01_CSV02 has failed. 
    [SNMP TRAP: 15006 in CPQCLUS.MIB]

    Critical 9-1-2019 21:11:20 FailoverClustering 1146 Resource Control Manager

    The cluster Resource Hosting Subsystem (RHS) process was terminated and will be restarted. This is typically associated with cluster health detection and recovery of a resource. Refer to the System event log to determine which resource and resource DLL is causing the issue.

    Error 9-1-2019 21:11:20 FailoverClustering 1069 Resource Control Manager

    Cluster resource 'CL01_CSV02' of type 'Physical Disk' in clustered role '3440da38-b8d8-4bb2-887f-4214602be999' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it.  Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    Warning 9-1-2019 21:11:31 Ntfs (Microsoft-Windows-Ntfs) 140 None

    The system failed to flush data to the transaction log. Corruption may occur in VolumeId: CL01_CSV02, DeviceName: \Device\HarddiskVolume24.
    (A device which does not exist was specified.)





    • Edited by Fred B. _ Thursday, January 10, 2019 12:54 PM
    Thursday, January 10, 2019 12:49 PM
  • Hi,
    Based on the complexity and the specific situation, we need do more researches. If we have any updates or any thoughts about this issue, we will keep you posted as soon as possible. Your kind understanding is appreciated. If you have further information during this period, you could post it on the forum, which help us understand and analyze this issue comprehensively.
    Sorry for the inconvenience and thank you for your understanding and patience.
    Best Regards,

    Frank

    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Friday, January 11, 2019 8:50 AM
    Moderator

  • We are still in a dispute with HPE over the StoreVirtual 3200 issues. Basicly they do not want to admit the issue, requiring latest HPE server firmware and Windows updates to be installed and switch buffers to be increased by lowering number of QoS queues on the switches from 8 to 2. We are in the process of doing so and testing to see if the issue still persists.

    Regarding the run away process in Windows 2012: could ISsciPrt and MPIO errors cause a runaway process crashing VM's, Fail Over Cluster nodes and the Fail Over Cluster itself due to redirected IO?

    Redirected I/O Uses SMB 3.0

    In W2008 R2, redirected I/O traffic passed over the cluster network with the lowest metric. That changed in WS2012. WS2012 uses SMB 3.0 for redirected I/O. This gives redirected I/O the best possible performance thanks to SMB Direct (if RDMA-capable NICs are used in the cluster) and, importantly, via SMB Multichannel.

    SMB Multichannel is going potentially to use any NIC it can find between the non-CSV owners and the CSV owner, flooding the network with unmanaged redirected I/O traffic. You can control which NICs are used by SMB Multichannel by using the New-SMBMultichannelContraints PowerShell cmdlet. Typically (though this depends on your network design) you will limit SMB Multichannel to the cluster’s private networks.

    https://www.petri.com/redirected-io-windows-server-2012r2-cluster-shared-volumes


    Monday, March 25, 2019 4:59 PM
  • Hi,

    >>could ISsciPrt and MPIO errors cause a runaway process crashing VM's, Fail Over Cluster nodes and the Fail Over Cluster itself due to redirected IO?

    In general,it won't affect the crash of VMs or nodes.

    It will locate the next node which connect with storage as the new coordinator node when the current coordinator node lose connection with storage.Then the old node through traffic via redirect I/O and block level redirect I/O.

    There is a related article for you,please refer to it.

    http://ramprasadtech.com/cluster-shared-volume/ 

    Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.

    Please understand, to solidly troubleshoot the root cause, we generally need to debug the crash dump files. Unfortunately, debugging is beyond what we can do in the forum. We can only provide some general suggestions here. 
    If the issue still occurs, a support call to our product service team is needed for the debugging service. We'd like to recommend that you contact Microsoft Customer Support Service (CSS) for assistance so that this problem can be resolved efficiently. To obtain the phone numbers for specific technology request please take a look at the web site listed below:

     https://support.microsoft.com/en-us/gp/customer-service-phone-numbers  

    Best Regards,

    Frank



    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com


    Wednesday, March 27, 2019 6:53 AM
    Moderator
  • For those interested in the cause, we were having massive port drops on the HP switch ports connected to the SV3200. During normal workload that would still function but slow. The HP switches where configured correctly only our servers were old according to HP (DL380 G7) and running W2008 R2. It was supported but also used as an excuse. 

     During the nightly backup or during a live migration of storage the iscsi ports in Windows 2008 R2 would come to a grinding halt due to the drops.

    We placed a seperate switch (Huawei) just for the iscsi network and no longer saw any port drops. 

    Wednesday, September 25, 2019 2:21 PM