none
iSCSIprt errors with persistent disconnections

    Question

  • What we need:

    ==========

    Determine why we have Persistent iSCSIprt errors in Hyper-V cluster Windows Server 2012

      

    Environment:
    ==========

    4x identical hosts Dell R720 running Windows server 2012 standard

    4x - iSCSI NICs: Intel(R) Gigabit 4P I350-t Adapter - Driver 12.1.76.0 - FW Family 13.0.0

    MTU set to 9000

    Chimey disabled, auto Tuning disabled, congestion provider set to None

    All kinds of OffLoad features were disabled in the iSCSI NICs properties

    No teaming configuration is in place

    RSS/TOE/TOEv2, Virtual Machine Queue (VMQ) and energy saving options are disabled

    2x PCT6224 - FW 3.3.5.5

    2 stack cables interconnecting them

    STP port-fast enabled on all ethernet ports

    MUT set to 9216 on all ethernet ports

    Flow control is active in all ports

    Speed in 1000 full duplex being auto negotiated

    Multi, broad and unicast storm disable

    None errors logged in any ethernet or stack ports

    3x EQL Storages

    2x EQPS6100X + 1x PS4100E all running 6.0.4

    Current situation:

    ===========

    Customer is experiencing timeout and disconnections:

    * All four hosts are logging the following System Events every 1 minute:

    ID 9 / Source iScsiPrt / Target did not respond in time for a SCSI request. The CDB is given in the dump data.

    ID 39 / Source iScsiPrt / Initiator sent a task management command to reset the target. The target name is given in the dump data.

    ID 129 / Source iScsiPrt / No Description

    This events shows up only in Windows Server 2012 environment, the issue was isolated and do not occur with Windows Server 2008  in the same SAN.

    Troubleshooting steps so far:

    ===================

    • Disable windows Firewall and make sure the non-san subnets are excluded in the hit kit *
    • Make sure the latest drivers from Dell D&D are installed for the iSCSI NICs
    • Jumbo Fram size setup correctly.
    • Disabled TOE, RSS and Large Send Offload, FlowControl setup
    • netsh interface tcp set global autotuninglevel=disable netsh int tcp set global chimney=disabled netsh interface tcp set global rss=disabled
    • Followed some practices from here: http://en.community.dell.com/support-forums/storage/f/3775/p/19480319/20326067.aspx
    • RSS/TOE/TOEv2, Virtual Machine Queue (VMQ) and energy saving options are disabled
    • All nodes are updated with the latest rollup update.
    • Teaming not configured.
    • Changed binding order on NICS
    • Disabled NetBIOS on iSCSI • Made sure that each NIC is NOT set to register its connection in DNS • Remove File and Printer sharing and Client from Microsoft networks
    • Following updates were applied:

    http://support.microsoft.com/kb/2791465/en-US ( kb 2779768 )
    http://support.microsoft.com/kb/2795944/en-US
    http://support.microsoft.com/kb/2822241/en-US
    http://support.microsoft.com/kb/2808584/en-US
    http://support.microsoft.com/?id=2838669
    http://support.microsoft.com/?id=2813630 suprimido no KB2838669
    http://support.microsoft.com/?id=2796000 
    http://support.microsoft.com/?id=2795997 
    http://support.microsoft.com/?id=2795993 
    http://support.microsoft.com/kb/2838669 

    • Disabled TCP Delay ACK in Server
    • Switches: STP port-fast enabled in all ethernet ports + MUT set to 9216 in all ethernet ports + Flow control is active in all ports + Unicast storm disabled
    • Captured iScsi traffic with wareshark

    Anyone can help us?

    VP


    Thursday, July 04, 2013 8:39 PM

Answers

All replies

  • Did you ever fix this, we have exactly the same issue. Our situation may be a little different.

    Two sites (with different subnets) combined in one Hyper-V 2012 R2 Failover Cluster.
    On each site 3 x R720 and 1 x PS6100X.
    All 6 servers are connected to both EQL group IPs.

    We don't get these 3 events when we create an individual Hyper-V Failover Cluster for each site.
    Wednesday, June 18, 2014 6:47 PM
  • yes, we fixed it. We disabled the AV in both nodes. Try that one.
    • Marked as answer by Walter Cesar Wednesday, June 18, 2014 7:20 PM
    Wednesday, June 18, 2014 7:20 PM
  • What's the AV??
    Wednesday, February 18, 2015 5:54 PM
  • Hello,

    I have three Hyper-V 2012 (not R2) nodes and I'm experiencing similar issues.

    I get iscsiprt error suddenly after a storage migration of a big VM ends: I launch the storage migration, it starts correctly and, more or less 60 seconds after it ends, iscsiprt errors appears in the event log of the node. I lose iSCSI connectivity with the iSCSI target on the storage were the VM was on before migration. The storage migrated VM works because is already moved to the "new" storage but obviously all the CSV on the "old" storage become inaccessible, I can see them offline or online (no access) so all the VMs that are on the old storage goes offline.

    I noticed that it didn't happen eveytime I storage migrate a VM, but only sometimes and with big VMs.

    I tried both with MPIO and with MCS since our storage supports both of them. I tuned some iSCSI parameters on the initiator. I leave a list of parameters I tried to tune at the end of this message.

    The only thing I didn't do already is: disable RSS, disable TCP autotuninglevel and disable delayed ACK.

    Someone can help me to understand what is causing these problems? What is AV you are referring to?

    This is the parameters I tuned in order to increase timeouts but right now I didn't yet solved the problem:

    HKLM\SYSTEM\CurrentControlSet\Services\mpio\Parameters
    UseCustomPathRecoveryInterval 0 -> 1
    PDORemovePeriod 20 -> 120
    PathRecoveryInterval 40 -> 40

    HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\<Instance Number>\Parameters
    EnableNOPOut 0 -> 1
    MaxRequestHoldTime 60 -> 90
    LinkDownTime 15 -> 35

    Thanks in Advance,

    Davide

    Thursday, March 05, 2015 6:21 PM