none
S2D disk performance problem - grinding to a halt.

    Întrebare

  • Hi All,

    I've recently built an 2016 S2D 4 node cluster and have run into major issues with disk performance:

    barely getting kb/s throughput (yep kilo and a small b - dial up modem speeds for disk access)

    vm's are unresponsive

    multiple other issues associated with disk access to the csv's 

    The hardware is all certified and as per Lenovo's most recent guidelines. Servers are ThinkSystem SR650, the networking is 100Gb/s with 2x Mellanox Connect-X4 adapters per node and 2x Lenovo NE10032 switches, 12x Intel SSD's and 2x Intel NVMe per node for the storage pool. RoCE/RDMA, DCB etc all configured as per the guidelines and verified (as far as I can diagnose). It should be absolutely flying along.  

    I should point out that it was working OK (though with no thorough testing done) for approx. 1 week. The vm's (about 10 or so) were running fine and any file transfers that were performed were limited by the Gb/s connectivity to the file share source (on older equipment serviced by a 10Gb/s switch uplink and 1Gb/s NIC connections at the source). 

    About 3pm yesterday I decided to configure the Cluster Aware Updating and this may or may not have been a factor. The servers were already fully patched with the exception of 2 updates: KB4284833 and a definition update for defender. These were installed and one at a time a manual reboot performed. Ever since, I've had blue screens, nodes/pools/csv's failing over and almost non-existent disk throughput. There is no other significant errors in the event logs, there have been cluster alerts as things go down - but nothing that has led to a google/bing search for a solution. The immediate thought is going to be "it was KB4284833 what done it" but I'm not certain that is the cause.

    Interestingly - when doing a file copy to/from the CSV volumes there is an initial spurt of disk throughput (but no where near as fast as it should be - say up to 100MB/s but could equally be as low as 7MB/s) and then it dies off to kB/s and effectively 0. So it look like there is some sort of cache that is working to some extent and then nothing.

    I've been doing a lot of research for the past 24 hours or so - no smoking guns. I did find someone with similar issues that were traced back to the power mode settings - I've since set these to High Performance (rather than the default balanced) but have seen no change (might be worth another reboot to double check this though - will do that shortly)  

    Any suggestions or similar experience? 

    Thanks for any help.

    marți, 3 iulie 2018 11:13

Toate mesajele

  • Hi,
    1.Please try to run Cluster Validation Wizard at a maintenance time, check if it will report any errors.
    2.Please try to uninstall the updates installed before the issue occurs.
    3.After uninstalling the updates, please run the powershell command “Repair-ClusterStorageSpaceDirect”, check if it could work.
    https://docs.microsoft.com/en-us/powershell/module/failoverclusters/repair-clusterstoragespacesdirect?view=win10-ps
    Thanks for your time! If you have any concerns or questions, please feel free to let me know.
    Best Regards,

    Frank



    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    miercuri, 4 iulie 2018 09:11
    Moderator
  • Thanks Frank,

    I've been working on this non-stop - it become critical (one of the VM's is running a production RDS licensing workload - and because the disk speed is so diabolical I can't migrate it to somewhere that works - it was crawling along but still working, until the cluster decided to put that VM into a state where the virtual machine configuration was going offline - but hung - long story and 2 hours of users not being able to login - but eventually got it back after powering off the node that hosted that role). 

    Cluster validation runs fine (couple of warnings due to vm's being offline and windows defender definition updates not in sync).

    Some progress today - lots of powershell testing of RDMA/RoCE, diskspd etc. Event logs for FailoverClustering-StorageBusClient have regular (several times a minute) entries for PathExceededLatencyLimit and DeviceExceededLatencyLimit with latency at 20,000+ ms. So something obviously wrong with access to the storage.

    Copy files to/from the CSV volumes is unusable - a host that owns the CSV trying to copy from C:\ClusterStorage\Volume 1 to C:\Temp is getting effectively 0kB/s throughput (it ranges from 0 to 7kB/s or so) - these drives are all SSD with NVMe for cache.

    RDMA/RoCE validation tests are all OK - so switch and network config looks alright (but...).  

    In order to try and narrow down the possible areas to troubleshoot I disabled the 2nd cluster network and it all came back to life (i.e. there are 2 physical 100Gb/s NICS per server, there is a SET enabled hyper-v switch with 2 vNICS per server (SMB1 and SMB2) each on separate VLANS (100 and 102). The cluster was configured to use both SMB1 and SMB2 for Cluster and Client traffic. I switched SMB2 to "Do not allow cluster communications on this network' and everything instantly started responding normally (when I say normally - it was working fine for a week or so since the initial build).

    So something is screwed up with the cluster communication with 2 networks in use. Each of the nodes can communicate fine with the other nodes on both networks. RoCE has been tested on both networks and checks out fine. I've followed both the Lenovo guide and "Windows Server 2016 Converged NIC and Guest RDMA Deployment: A Step-by-Step Guide" in building it - so I don't think I've done anything too crazy. And it was working fine until enabling CAU and applying the KB update. I'm not sure either of those are to blame - just happened to be applied before the servers rebooted and issue started.

    I can now move the critical loads off the cluster and will do further testing before moving anything back. The single SET network is still redundant in case of failure - just not getting the full goodness of 2x 100Gb/s.

    Cheers 

      

    miercuri, 4 iulie 2018 11:25
  • Hi,

    Thank you for sharing the detailed troubleshooting process with us.

    Based on my understanding, after we configure the SMB2 network to be “Do not allow cluster communications on this network”, everything goes normal. For cluster storage network, it’s recommended to configure “Do not allow cluster communications on this network”. Here is an article for your reference:

    https://blogs.technet.microsoft.com/askcore/2014/02/19/configuring-windows-failover-cluster-networks/

    Thanks for your time! If you have any new progress on this issue, welcome to share with us. Thanks again.

    Best Regards,
    Frank


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    joi, 5 iulie 2018 06:28
    Moderator
  • Thanks Frank - but not quite answered/solved.

    Having a second network - either on the same VLAN (and IP range) or a separate VLAN should be working according to the documentation and configuration guides.

    I've made a few changes and at the moment it is currently working with both SMB1 and SMB2 networks enabled for Cluster and Client communication. Though I'm a little concerned about long term stability.

    I'm not 100% sure exactly what change I made that got it working but here are the main points in case someone else comes across this:

    1. I switched the 2nd network to "do not allow cluster..." and that bought things back to life

    2. Not happy with the fact that half the available throughput was switched off and the documentation all points to it should be working, I thought I'd have another crack (with critical workloads now migrated back to somewhere more stable).

    3. I reconfigured the SMB2 network - changed all the vNIC ip's and set the VLAN to the same network as SMB1 (Set-VMNetworkAdapterVlan -VMNetworkAdapterName SMB2 -VlanId 100 -Access –ManagementOS)

    4. That immediately cause disk issues (DeviceExceededLatencyLimit) picked up in the FailoverClustering-StorageBusClient log. So changed it all back (quickly!)

    5. So back to VLAN 100 (SMB1) and VLAN 102 (SMB2) on 10.x.0.y and 10.x.2.y respectively. There were still disk performance issues as the cluster picked this change up again and I needed to set it back to "do not allow ..."

    6. So the suspicion is there is some sort of network communication issue between the 2 vNIC's involved. The pNIC's appear fine as I can see the traffic flowing fairly evenly over them with the one cluster network enabled. I can ping quite happily between all vNIC IP addresses so it's not just a simple config issue. RoCE testing using either vNIC is successful.

    7. Next thought is Jumbo frames - on further investigation this is enabled at the pNIC and physical switch level but I can't ping between nodes from SMB1 to SMB2 using "-l 8000 -f" so the vNIC's are not happy. I enable Jumbo frames on each of the vNICS and then re-enabled the SMB2 network for "Cluster and Client" communication and it's currently happy.

    8. So problem hopefully solved - but I haven't rebooted any of the nodes yet, I haven't re-enabled CAU and I'm quite trepidatious about the long term stability - very reluctant to move any production roles over until I can confidently explain why it was playing up.

    If anyone else has had similar experiences or could reproduce the issue I'd love to hear about it.

    Cheers 

    vineri, 6 iulie 2018 12:14
  • Hi,

    Appreciate your sharing and support.

    Best Regards,
    Frank


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    luni, 9 iulie 2018 02:56
    Moderator
  • Hopefully one last update...

    I've tracked down the issue to one of the switch uplinks in the vLAG not transmitting jumbo packets, even though it is configured to do so:

    RX

        2662363520 unicast packets  24649803 multicast packets  22867958 broadcast packets

        2712758319 input packets  2917938345400 bytes

        2876904 jumbo packets  0 storm suppression packets

        0 giants  2877018 input error  0 short frame  0 overrun  0 underrun

        0 watchdog  13440 if down drop

        0 input with dribble  13440 input discard(includes ACL drops)

        0 Rx pause

      TX

        1965281563 unicast packets  1564500 multicast packets  524976 broadcast packets

        1967371055 output packets  1267488164177 bytes

        0 jumbo packets

        0 output errors  0 collision  0 deferred  0 late collision

        0 lost carrier  0 no carrier  0 babble

        0 Tx pause

    It took a long time to find this as everything is configured in a redundant way and I was focussed on the S2D cluster side of the configuration. Each of the nodes could ping each other fine with jumbo packets - so had assumed that the switch side of things was all working. Only when I was getting network issues on one of the test VM's did I see that it could ping two of the hosts OK but not the other 2 when using Jumbo packets (but fine with a standard MTU), so I started investigating each and every possible path through all the various network components. 

    Now just need to sort out why that particular port doesn't want to play with Jumbo packets.

    Cheers

    luni, 9 iulie 2018 04:09