locked
Windows Defender Realtime Monitoring Causes CSV/Live Migration Failures RRS feed

  • Question

  • This is a straight copy/paste from the post I made in the Hyper-V section. I still think it's more likely to apply there, but was asked to post this here as well.

    https://social.technet.microsoft.com/Forums/windowsserver/en-US/b65e22c0-871a-4565-bdb5-2734ecd795f3/windows-defender-realtime-monitoring-causes-csvlive-migration-failures?forum=winserverhyperv

    We have two functioning Hyper-V clusters running Windows Server 2016 Datacenter. Each cluster is at a different site. While each cluster uses two different server types and different number of cluster nodes, all other hardware, switches, storage, etc. is the same.

    In June we performed normal updates on both clusters and all VMs hosted on those clusters. Shortly after, the process of pausing a node to drain/stop it would fail with another node going into isolation. This also caused CSVs to go unavailable for a time and the VM workload on that node to be stopped and restarted. After a month or so, the host isolation events disappeared, however CSV volumes would still fail to completely go offline or come back online while drain/stopping a host node. Again, this caused the VM workload on the host to be stopped and restarted.

    It would seem that the process of VMs live migrating and the CSVs moving at the same time caused the CSVs to hang on occasion. This occurred randomly, and was not guaranteed to cause the problem each time. However, live migrating the VMs manually and then moving the CSVs or vice versa would allow the node to be paused without issue. As such, finding the cause of the problem proved to be very difficult.

    In the end, we built up a separate cluster on much different hardware, but again, the same supporting storage, switches, etc. We were able to reproduce our problem and make it occur more frequently by increasing the IO workload on the CSV volumes while pause/draining a host node. When the failure happened, we often got Microsoft-Windows-FailoverClustering Event ID 5142 - Cluster Shared Volume 'CSV5' (CSV5) is no longer accessible from this cluster node because of error '(1460)'. But not always. For whatever reason, CSV volumes were not fully transitioning when being moved to a new host node while live migration was being performed.

    After attempting to rebuild the cluster with an older update, resetting the switch firmware and a number of other trouble shooting steps, we found the answer to be Windows Defender. It appears the Realtime Monitoring feature would cause this problem and disabling it using either of the two steps fixed our problem completely (on Host cluster nodes):

    1. Start -> Settings -> Update & Security -> Windows Defender -> Real-time protection (set to off)
    2. From an administrative PowerShell command: Set-MpPreference -DisableRealtimeMonitoring $true

    This can be verified with the Get-MpPreference command and checking the value of DisableRealtimeMonitoring. Or you can check the status of the WdNisSvc (Windows Defender Network Inspection Service) service. This service should go to the stopped state when Realtime Monitoring is disabled.

    This is extremely surprising since Microsoft's own documentation states that when specific roles are installed exclusions are automatically added to Defender.

    https://docs.microsoft.com/en-us/windows/security/threat-protection/windows-defender-antivirus/configure-server-exclusions-windows-defender-antivirus#list-of-automatic-exclusions

    Additionally, these automatic exclusions generally match the recommended exclusions for Hyper-V.

    https://support.microsoft.com/en-us/help/3105657/recommended-antivirus-exclusions-for-hyper-v-hosts

    Again, using the Get-MpPreference command, one can check the value of DisableAutomaticExclusions to see if they are being used. Even if you manually add the recommended exclusions, it appears as if Windows Defender ignores these. Only disabling Realtime Monitoring fixes the problem we were experiencing.

    I was surprised that I was unable to find anyone else experiencing this issue since we are not doing anything out of the norm for our clusters. Everything is pretty much a basic Windows installation and configuration. I cannot reproduce this issue on Windows Server 2019. Any additional help about what we might be doing wrong to cause this problem would be great.

    Thanks.

    • Edited by Joe Bandura Wednesday, November 13, 2019 6:05 PM
    Wednesday, November 13, 2019 6:05 PM