none
How to find root cause for a DFS sever performance issue which was resolved by vmotion to another host RRS feed

  • Question

  • We are having a problem with one DFS windows 2016 server.
    The server out of the blue starts responding slowly.
    CPU, memory and disk latency are all fine.
    No AV scans are running. AV is a ESX host level appliance and no AV inside the windows OS.
    On one the occasions when it was slow I saw the smb queue length quiet high.
    On the other occasion I did not have time to check this queue length.
    On both occasions backups were still running. Backups are agent based (agent installed on windows server).
    I migrated the VM to another host and it started working fine on both occasions.

    Vmware engineer is saying that vmotioning to another host will freeze the vm causing all tcp connections on the vm to either reset  or reconnect.
    Is this correct?
    VMware according to my knowledge say that the IO is committed to a delta disk and then commited to parent disk after vmotion. Is that right?

    How can I find if it is the Vmware/storage/network layer causing issue or OS?
    As moving to another host solves the problem so it seems to be something at the unerlying layers rather then the OS.
    But which logs can prove that?
    And if it happens again then what data should we gather to find the root cause.

    • Edited by M_C_7 Saturday, September 21, 2019 10:12 PM
    Saturday, September 21, 2019 10:11 PM

All replies

  • Hi,

    Thanks for posting in our forum.

    Vmware engineer is saying that vmotioning to another host will freeze the vm causing all tcp connections on the vm to either reset or reconnect.

    Is this correct?

    VMware according to my knowledge say that the IO is committed to a delta disk and then commited to parent disk after vmotion. Is that right?

    I’m sorry that I couldn’t give you an exact answer because I’m not familiar with Vmware, we should confirm it with VMware support team.

    How can I find if it is the Vmware/storage/network layer causing issue or OS?

    As moving to another host solves the problem so it seems to be something at the unerlying layers rather then the OS.

    But which logs can prove that?

    If the issue is occurring again, we can check event logs to see if there has disk related error event logged and use performance monitor to monitor disk performance and latency.

    For your reference:

    https://blogs.technet.microsoft.com/askcore/2012/02/07/measuring-disk-latency-with-windows-performance-monitor-perfmon

    https://docs.microsoft.com/en-us/azure/monitoring/infrastructure-health/vmhealth-windows/winserver-disk-currqueuelength

    Best Regards,

    William


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, September 23, 2019 9:20 AM
    Moderator
  • Hi,

     

    Just checking in to see if the information provided was helpful. Please let us know if you would like further assistance.

     

    Best Regards,

    William

     


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, September 25, 2019 7:04 AM
    Moderator
  • Hi,

    Welcome to share your current situation.

    Please feel free to let us know if you need further assistance.

     

    Best Regards,

    William


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, September 27, 2019 9:37 AM
    Moderator