none
Poor disk performance with Dell Compellent SAN, VM sees high queue but Hypervisor does not

    Question

  • Hi,

    we are migrating from another SAN to Dell Compellent. Now, some VMs have REALLY poor disk performance when the VHDX files are stored on Compellent. Of course this looks like a SAN issue, not like a Hyper-V issue. But strange enough, performance from the Hyper-V host against the Compellent volumes is fine and constant - a benchmark with 512-byte blocks always returns something between 2100 and 2400 MB/s.

    The same benchmark on VMs, however, sometimes returns the same values but sometimes only 30 or 35 MB/s (which is horrible). Some VMs have almost constantly 100 % disk utilization. Sometimes virtual servers take 20 minutes to boot (instead of 1 minute). But only sometimes; at other times the VMs are performing perfectly fine.

    I was especially wondering about the disk queue. I used perfmon to monitor:

    • Average disk queue length for C: drive of a VM (red)
    • Queue length for Virtual Storage Device for the corresponding VHDX file on Hypervisor (green)

    I wonder how the disk queue length for SERVER's C: drive can be like 20 while the Hypervisor says the disk queue length for SERVER-C-VHDX is 0.1 ... Neither the Hypervisor nor the SAN see any high latency, or queue higher than around 0.5, still some VMs have absolutely horrible performance.

    All affected VMs are Gen2 VMs with Server 2012 R2; the Hypervisors are also running 2012 R2. SAN uses 8G FC with Brocade switches.

    There is no storage redirection in place, nor do FC switches or adapters see any errors. I also tried different MPIO options (like using only a single patch instead of all 4 available paths), but that did not change a thing.

    <strike>One thing I noticed is that Compellent provides volumes with native 4K sectors. Of course this sounds like the root cause, but I did tests with native 4K VHDX files (new-vhds -physicalsectorsizebytes 4096) which showed the same horrible performance.</strike>

    Is there anything I could do, or anything where I could investigate more?


    • Edited by svhelden Wednesday, March 15, 2017 6:07 AM fixed
    Sunday, March 12, 2017 6:51 AM

Answers

  • It does not explain the different benchmark results from host and VM, but it seems that the Compellent disks just can't handle the number of IOPS caused by our cluster. (Our old system had more and faster disks.)

    There may be additional issues in the Hypervisor setup, but it does not make sense to investigate at this point.

    • Marked as answer by svhelden Wednesday, March 15, 2017 6:11 AM
    Wednesday, March 15, 2017 6:11 AM

All replies

  • Hi Svhelden,

    The phenomenon seems strange. I suggest you open a case with Microsoft, more in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.
    Here is the link:
    https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0

    Best Regards,
    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, March 13, 2017 1:56 AM
    Moderator
  • It does not explain the different benchmark results from host and VM, but it seems that the Compellent disks just can't handle the number of IOPS caused by our cluster. (Our old system had more and faster disks.)

    There may be additional issues in the Hypervisor setup, but it does not make sense to investigate at this point.

    • Marked as answer by svhelden Wednesday, March 15, 2017 6:11 AM
    Wednesday, March 15, 2017 6:11 AM
  • We have the same type of issue with our Compellent SAN, did you investigate any further and if so did you find anything useful?
    Wednesday, October 11, 2017 7:19 PM
  • We have the same type of issue with our Compellent SAN, did you investigate any further and if so did you find anything useful?

    Actually it fixed itself, and no one really knows why.

    Not sure if it was related, but around that time we reduced the number of storage levels. Originally there was

    • Tier 1 RAID 10
    • Tier 1 RAID 5
    • Tier 2 RAID 5
    • Tier 3 RAID 10
    • Tier 3 RAID 6

    And the default storage profile used all tiers, so most blocks could be moved between 5 different storage levels. Copilot eliminated the Tier 3 RAID 10 and shrinked the Tier 1 RAID 10 to the minimum (which is required for internal tasks). Also we modified the storage profiles so that servers use a maximum of two storage levels, either

    • Tier 1 RAID 5
    • Tier 2 RAID 5

    or

    • Tier 1 RAID 5
    • Tier 3 RAID 6

    According to most Dell engineers, this should have negative impact on performance, and they recommended to revert it. But still, around the time when we did these changes, Exchange issues went away. We can't tell if that was coincidence or not.

    Are you also running Exchange 2016 on Hyper-V 2016? And is your SAN connected by FC?



    • Edited by svhelden Thursday, October 12, 2017 4:29 AM
    Thursday, October 12, 2017 4:28 AM
  • We are having the same problem with our SC4020.

    When backup is running from Veeam and need to do a full backup it will touch the data on NL-SAS drive and SSD and are not able to get more than 35Mb/s and get really bad speed and timeout for our file server or any other VM that sit on the same CSV event if we only try to read write onSSD.

    Tier1 - SSD x12

    Tier3 - NL-SAS x12

    Tuesday, December 05, 2017 4:55 PM