none
Decreasing disk performance on several VMs on 2012r2 Hyper-V cluster

    Question

  • Since a couple of months, we are having some really frustrating issues with decreasing disk performance on several VMs running on a Windows Server 2012 R2 Hyper-V cluster. We noticed frequently that some of our VMs run into a state where disk performance gets incredibly slow. As soon as this happens to any of our VMs within the cluster, the condition of the affected VM always shows same characteristics:

    1. The overall system performance drops and the VM gets quite unresponsive (e.g. while working via RDP on the VM). Usually the intended tasks of the affected system (e.g. running automated software build processes) begin to show massive delays.
    2. The resource monitor / disk section on the affected VM shows alomost consistently 100 % highest active time. Disk response time values of most processes are in the range of several hundred milliseconds.
    3. Running disk benchmark tools like “CrystalDiskMark” on the affected VM results in very poor throughput values. Often we get some sequential read/write values below 50 MB/s. Sometimes the values are even below 10 MB/s.

    As soon as one of our VMs runs into this state, the only known workaround to resolve this issue so far (at least temporarily) is using the live migration feature of Hyper-V and move the affected VM to another node within the cluster. After the VM has been live migrated, the disk performance boosts up instantly. The system gets responsive immediately again. Highest active time / disk response times (in resource monitor) begin to normalize. Executing another run of “CrystalDiskMark” shows much better throughput values between 750-1200 MB/s READ and 600-900 MB/s WRITE. If we proceed to live migrate the affected VM a second time (which means migrate it back to the original node), disk performance still stays fine at first glance. In the following, there could be a few days without any disk performance issues and suddenly the exact same VM (or any other random VM) start to show the same issues again. At this point we are forced to use our workaround and live migrate the affected VM(s) across our cluster again to boost up disk performance. This procedure repeats over and over again.

    As I said before it can happen to any random VM. It is not just always the same or the same set of VMs, which are affected. In addition, there seem to be no dependencies on which node the affected VM is currently running on or on which CSV the VHDX-file of the VM currently resides. To sum it up, it could happen to any VM within the cluster, no matter on which node it is running or no matter which CSV is used by the VM.

    Here is some additional information about our environment:

    • SAN based failover cluster with five nodes
    • Server = Dell PowerEdge R720
    • 10 Gigabit ISCSI- and Livemigration networks with QLogic BCM57810 NICs and Netgear XS712T switches
    • Storage: Dell EqualLogic group with 3 members (1 x 6210XS-SSD, 1 x 6210X-SAS, 1 x 4110E-NLS)

    I already logged a support case at Dell and the storage system has been checked in depth by the EqualLogic technicians before. They were not able to find any performance related issues with the storage itself. From the storage point of view, the workloads can be handled easily and there are still a lot of unused capacities. Additionally, the network configuration has been discussed and approved with the Dell support team. As a result of the case with Dell, the support team finally referred to Microsoft because they assume that the reasons for the disk performance issues have to be searched on Hyper-V side.

    Of course, we also tried several different measures in order to improve the situation:

    • Firmware and driver updates on server nodes / switches / storage
    • Testing different best practice recommendations for network settings of ISCSI network
    • Installation of recommended hotfixes and windows updates for Server 2012 R2 clusters
    • Doing a lot of research on the internet

    So far absolutely no luck yet.

    Here is another thread, which discusses the same issues as we experience:

    https://social.technet.microsoft.com/Forums/windows/en-US/0ba851de-1030-4cf1-8bf0-e158b95df776/slow-vm-disk-performance-on-ws2012-hyperv-cluster?forum=winserverhyperv

    Apparently, this issue could have been resolved by a hotfix but the patch named here is applying to Windows Server 2012 (non R2). I couldn’t find an equivalent hotfix for Server 2012 R2 based systems.

    Any assistance or advice on this topic is highly appreciated, as we really want to get back to a state where we don't have to deal with such annoying performance issues constantly.

    Thanks a lot in advance for any help!

    Regards
    Dominik


    • Edited by Soloplan_DB Wednesday, April 12, 2017 12:12 PM
    Wednesday, April 12, 2017 12:01 PM

Answers

  • Looks like we have finally resolved this issue for our environment.

    We discovered by accident, that there must be some connection between the performance issues and the backup software used for backing up the VMs.

    That's why we opened a ticket with the support team on this side and they confirmed that there are some known issues.

    "The behaviour is indeed something which we have seldomly encountered and it would seem to be caused by the CBT driver. We are working on a permanent solution and are expecting a released later on this quarter. For now, stopping the CBT driver would be the only way to circumnavigate this issue..."

    After following the provided steps for disabling the CBT driver, the issues are gone completely.

    Similar issues in context of CBT driver also described here...

    https://social.technet.microsoft.com/Forums/en-US/9fc80785-6f9d-4175-9d85-b7d380c90cdb/cbt-driver-causing-high-disk-queues-in-vms?forum=winserverhyperv

    • Marked as answer by Soloplan_DB Friday, November 24, 2017 11:02 AM
    Friday, November 24, 2017 11:01 AM

All replies

  • Hi Sir,

    iSCSI storage using separate NICs and subnet for hyper-v cluster nodes ?

    It is a two node cluster ?

    This issue happens on the cluster node even though the node is the owner of CSV disk ?

    Other VMs running on same host didn't have this issue when the first problematic VM occurred ?

    Any further information please feel free to let us know .

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Sunday, April 16, 2017 6:52 AM
    Moderator
  • Dear Elton,

    thanks for your feedback! Please find the answers to your questions below:

    >> iSCSI storage using separate NICs and subnet for hyper-v cluster nodes ?
    Yes, we are using dedicated NICs and subnet exclusively for ISCSI traffic between the nodes and storage system.

    >> It is a two node cluster ?

    No, it is a cluster with five nodes.

    >> This issue happens on the cluster node even though the node is the owner of CSV disk ?

    Yes, there were definitely some cases where this issue appeared for VMs running on a node which was the current owner of the respective CSV on which the VM-files resided.

    >> Other VMs running on same host didn't have this issue when the first problematic VM occurred ?

    Could be more than one VM but doesn't have to. Today for example we got two problematic VMs at the same time running on the same host. All remaining VMs on this host were fine instead. But in most cases it is just one problematic VM at one time. By the way, the VM-files (VHDX, config etc.) of the two problematic VMs today reside at different CSVs, even at a different members of the storage group.

    Regards
    Dominik


    Tuesday, April 18, 2017 1:25 PM
  • Were the VMs exhibiting the issue created from scratch or were they physical to virtual conversions?

    tim

    Tuesday, April 18, 2017 1:30 PM
  • @Tim: We don't run any VMs which are physical to virtual conversions. However, some of our VMs are VMWare to HyperV conversions which were migrated by MVMC earlier. Issue appears both on migrated VMs and on VMs which were created from scratch on HyperV.
    Wednesday, April 19, 2017 5:37 AM
  • Hi Soloplan,

    Thanks for replying the questions .

    But , I have no idea/explanation for this odd behavior .

    If it is possible ,I'd suggest you contact with Microsoft :

    https://www.microsoft.com/en-us

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Sunday, April 23, 2017 10:09 AM
    Moderator
  • We have been experiencing the very same problems as you have described here. VM performance is terrible. Disk queue lengths are high on the VM, but drop instantly once the VM is migrated to a new host. Have you had any success in resolving the issue? 

    I have spoken with our hardware vendor professional services and the only thing they could come up with was changing RSS and Dynamic VMQ from using the first core from the processor pool. I have yet find a maintenance window to make the changes.

    http://windowsitpro.com/hyper-v/why-you-skip-first-core-when-configuring-rss-and-vmq

     

    Tuesday, May 16, 2017 3:33 PM
  • Do you happen to be running Sophos antivirus?  I've seen reports where they have an issue (and a patch) that causes this.

    tim

    Wednesday, May 17, 2017 2:20 AM
  • @Tim Cerling:

    No, we are not running Sophos antivirus in our environment.

    @hyperv_admin:

    No success on resolving this issue yet. I opened a support case at MS today. If the support team is able to help us resolving this issue, I will post the results here.

    Wednesday, May 17, 2017 10:40 AM
  • @Soloplan_DB

    Any updates on your work with Microsoft on this issue? 

    Thursday, June 01, 2017 8:46 PM
  • I have the same problem on 2 different environments with different hardware. Most of the VMs showing high disk activity/queue length then dropping IO we have re-designed one environment from the SAN up with no change. Currently in the progress of rolling upgrading one environment to Hyper-V 2016 doesn't seem to be as much as a problem with 2016 hosts. 

    I have a case open with Microsoft I will update when I hear something tonight/tomorrow.

    Tuesday, June 06, 2017 10:06 AM
  • @hyperv_admin:

    Case is still on investigation. Up to now some detailled information about our environment has been collected and some general settings have been checked / verified. Hope to make some steps forward this week. I will let you know as soon as I have some results which are worth to post here.

    Wednesday, June 07, 2017 8:23 AM
  • @Soloplan_DB

    I have MS looking at my environment at the moment we have an ESXi cluster that uses the same blade chassis and SAN and it has no issues. In actual fact I moved VMs and their workload from the Hyper-V environment to the ESXi environment and problem went away for those VMs.

    The only way I have made some servers work is have 1 disk per CSV but that will never work long term!

    Wednesday, June 07, 2017 9:12 AM
  • Update: we have seen that it is only a problem with VMs in a 2012 R2 cluster using CSVs. We have reproduced the problem for Microsoft I have also provided them the link to this technet post and a number of other posts which they are looking into at the moment.

    Hopefully I will have more in the next few hrs.

    Thursday, June 08, 2017 5:05 AM
  • @Soloplan_DB

    @wizedkyle

    Any updates from Microsoft on your Hypervisor environments?

    Tuesday, June 20, 2017 1:03 PM
  • I received some feedback with a "to do list" by MS (e. g. disable anti virus on affected VM) which will be processed internally within this week. Currently no solution yet.
    Tuesday, June 20, 2017 3:17 PM
  • @hyperv_admin

    We have applied the latest June security rollup and have seen a considerable performance increase, we have emailed Microsoft to see if there is anything in that patch that could produce this affect.

    On another note if you rolling upgrade to a 2016 cluster it seems to fix the problem as well.

    However this is all my investigations/changes no official word from MS.

    Wednesday, June 28, 2017 3:23 AM
  • Thanks for sharing this information, wizedklye!

    We applied the latest June updates on all cluster nodes yesterday.

    If someone else is searching for the respective patches, please find them here:

    https://support.microsoft.com/en-us/help/2920151/recommended-hotfixes-and-updates-for-windows-server-2012-r2-based-fail

    The following updates have been installed on our environment:

    1. KB4022717 (instead of KB4022726, as KB4022726 installer says the update doesn't apply to the system after installing KB4022717. Looking into the details of both KB numbers, you'll find the same description. So I guess they contain the same / similar fixes)
    2. KB3137728
    3. KB3145384

    So far everything looks fine, but it is way too early to decide if the issues are finally resolved. I will observe the behaviour further during the next few days and leave another feedback here.

    Monday, July 31, 2017 11:39 AM
  • Looks like we have finally resolved this issue for our environment.

    We discovered by accident, that there must be some connection between the performance issues and the backup software used for backing up the VMs.

    That's why we opened a ticket with the support team on this side and they confirmed that there are some known issues.

    "The behaviour is indeed something which we have seldomly encountered and it would seem to be caused by the CBT driver. We are working on a permanent solution and are expecting a released later on this quarter. For now, stopping the CBT driver would be the only way to circumnavigate this issue..."

    After following the provided steps for disabling the CBT driver, the issues are gone completely.

    Similar issues in context of CBT driver also described here...

    https://social.technet.microsoft.com/Forums/en-US/9fc80785-6f9d-4175-9d85-b7d380c90cdb/cbt-driver-causing-high-disk-queues-in-vms?forum=winserverhyperv

    • Marked as answer by Soloplan_DB Friday, November 24, 2017 11:02 AM
    Friday, November 24, 2017 11:01 AM