none
Hyper-v live mitration fail when hyper-v replica is enabled in virtual machines RRS feed

  • Question

  • Hi.

    INTRODUCTION

    We have a cluster (version 2016) in production with the following characteristics.

    • 4 HP Proliant 380 G6 servers with 96 GB RAM each. (With its latest drivers and windows server 2016 datacenter updates).
    • 4 network cards; 2 for iscsi, 1 for livemigration + cluster, and 1 for the production + management network.
    • 5 512GB iscsi disks shared from a QNAP TS-879U-RP (with the latest firnware version).
    • 80 small virtual machines with windows and linux.

    By testing, we have verified that Live Migrations work perfectly when virtual machines do not have the Hyper-V Replica option active. Through a powershell script we have been for hours migrating every minute machines from one server to another without any problem.

    DESCRIPTION OF THE PROBLEM

    We want to have all virtual machines replicated on a server, so we have activated the hyper-v replica option on all virtual machines. Randomly live migrations fail. The virtual machine hangs in stopping state, totally inaccessible. There is no way to recover it except restarting the server, which takes several hours, as some service is blocked, so the server does not restart until after several hours. We tried to unlock the virtual machine by stopping the process linked to the virtual machine, stopping the hyper-v services or the cluster service, without success. As I say, the only way to recover is to restart the server and the cluster automatically move the machine on another host.

    As a note of interest, comment that the cluster is upgraded to the 2016 version from a 2012 version. Before updating to Windows Server 2016 we never had these problems before. Live migrations failed sometimes, but never left the physical host "hung".

    Greetings and thanks.


    MCSA: Windows Server 2008

    Tuesday, January 17, 2017 10:11 PM

Answers

  • Hello everyone

    I spoke to Microsoft support. On all the Hosts - They asked me to uninstall my Anti-virus, restart, enable windows defender and once that's enabled run the following in powershell (to add exclusions)

    Set-MpPreference -ExclusionPath c:\clusterstorage, %ProgramData%\Microsoft\Windows\Hyper-V, %ProgramFiles%\Hyper-V, %SystemDrive%\ProgramData\Microsoft\Windows\Hyper-V\Snapshots, "%Public%\Documents\Hyper-V\Virtual Hard Disks"

    Set-MpPreference -ExclusionProcess %systemroot%\System32\Vmwp.exe, %systemroot%\System32\Vmms.exe -Force

    Set-MpPreference -ExclusionExtension *.vhd, *.vhdx, *.avhd, *.avhdx, *.vsv, *.iso, *.rct, *.vmrs, *.vmcx


    I did this and initiated live migration which worked great between the nodes. I live migrated at least 10 times. Live migration with replication enabled is successful.

    It looks like live migration with windows defender enabled - works

    Can someone try the same thing and let me know how you get on. Can Windows Defender really be the culprit??.


    • Edited by TechTifa Monday, June 12, 2017 4:24 PM
    • Proposed as answer by Trevor TyeMVP Tuesday, June 13, 2017 12:58 AM
    • Marked as answer by Cimmerio Monday, June 19, 2017 3:49 PM
    • Unmarked as answer by Cimmerio Tuesday, June 20, 2017 2:44 PM
    • Marked as answer by Cimmerio Tuesday, August 13, 2019 7:50 AM
    Monday, June 12, 2017 4:22 PM

All replies

  • Hi Cimmerio,

    Have you configured Hyper-V replica broker?

    https://blogs.technet.microsoft.com/virtualization/2012/03/27/why-is-the-hyper-v-replica-broker-required/

    >>Before updating to Windows Server 2016 we never had these problems before.

    Do you mean the issue happened after the upgrade?

    Have you run cluster validation?

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, January 18, 2017 6:28 AM
    Moderator
  • Hello Leo Han.

    I did not mention it, but I did. The role of the broker is created and working.

    I wanted to say that before upgrading the operating system from the servers to Windows Server 2016, I never had this problem. Live migration sometimes failed, but the server was never hung.

    The validation of the cluster is valid. There are some warnings but they are not important.

    The error that occurs when virtual machines are hung in the "Stopping" state is as follows:

    'VIRTUAL MACHINE NAME' could not perform the operation 'Clearing Obsolete Benchmarks'. Currently, the virtual machine is performing the following operation: 'Moving virtual machine'. (Virtual machine identifier F26EA46F-C481-4E46-8214-6F8594569D4)

    Source: Hyper-V-VMMS
    ID: 19060


    Regards


    MCSA: Windows Server 2008

    Wednesday, January 18, 2017 6:46 PM
  • Hi Cimmerio,

    >>I wanted to say that before upgrading the operating system from the servers to Windows Server 2016, I never had this problem.

    Probably some file got corrupted during the upgrade and caused the issue.

    I suppose you may try a fresh install of Server 2016, not upgrade.

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, January 19, 2017 6:31 AM
    Moderator
  • Sorry Leo, I expressed myself wrong. I did a new installation of windows server 2016 on each host. I followed these steps to upgrade the cluster.

    https://technet.microsoft.com/en-us/windows-server-docs/failover-clustering/cluster-operating-system-rolling-upgrade

    Regards.


    MCSA: Windows Server 2008

    Thursday, January 19, 2017 7:07 AM
  • Hi Cimmerio,

    I'm not able to find any official documents about if there is any conflict between live migration and hyper-v replica.

    I suggest you open a case with Microsoft, more in-depth investigation can be done so that you would get a more satisfying explanation and solution to this issue.
    Here is the link:
    https://support.microsoft.com/en-us/gp/contactus81?Audience=Commercial&wa=wsignin1.0

    Best Regards,
    Leo


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, January 20, 2017 2:38 AM
    Moderator
  • We are experiencing the same issue at the moment, IBM Bladecenter with 6 Blades and SAN Storage. Previously on Server 2012 R2 we did not have the problem either.

    We are replicating to a 2012 R2 Host, from the 2016 Cluster. Could this be the issue.

    Thursday, February 2, 2017 10:35 AM
  • We are experiencing the same issue at the moment, IBM Bladecenter with 6 Blades and SAN Storage. Previously on Server 2012 R2 we did not have the problem either.

    We are replicating to a 2012 R2 Host, from the 2016 Cluster. Could this be the issue.

    Hello LeoDuPreez.

    Our replica server has installed windows server 2016, so I think that's not the cause.

    I've opened a support ticket with Microsoft. We're doing tests. It is very likely to be a Windows Server 2016 bug.

    I will keep this post updated.



    MCSA: Windows Server 2008

    Thursday, February 2, 2017 3:34 PM
  • Hi Cimmerio

    After extensive investigations I'm still sting with the same issue.

    1. We have upgraded the Replica Server to Windows Server 2016

    2. Enable Hyper-V Replication between Cluster and Standalone Host (All servers are on 2016)

    3. Enabled replication of Virtual Machine

    4. Move the Virtual Machine using SCVMM 2016, the Virtual Machine moves and gets to around 85%, when I look at the status of the VM in Hyper-V manager on the Cluster node the Virtual Machine is in a stopping state. Only way to resolve is to Hard reboot the Node.

    5. Shutting down virtual Machine from within SCVMM 2016 that is replicated also ends up in a shutting down state.

    6. When the Machine is in the shutting down state, I cannot ping it or use Hyper-V tools to turn it off. I can remove replication though. When replication is removed and I go to the CSV there is one HRL file and its locked, cannot delete it.

    7. Now for the oddest one of all SCVMM 2016, when deleting a replicated VM from Hyper-V manager on the replication host it shows up in SCVMM 2016 as missing. Selecting the VM and deleting results in SCVMM 2016 deleting all the Hyper-V Replica's VHD's files. if I go into the Virtual Hard Disk folders and browse the GUID folder of the Hyper-V Machines everything is empty.

    Any updates are appreciated.

    Regards

    Leo

    Friday, March 24, 2017 1:48 PM
  • We have just upgraded a Customer from VMWare to Server 2016. Running Oracle Servers X6-2 with VNX Storage and Clarion Storage. We are replicating from the VNX Storage (New) to the old Clarion Storage and we are experiencing the same issues.

    I need the issues resolved as the customer is now considering going back to VMWare

    Kind Regards

    Leo

    Friday, March 24, 2017 2:07 PM
  • Hello Leo.

    At the moment we have an open case of support with Microsoft. We do not have a solution to the problem yet. I will update this post.

    Regards.


    MCSA: Windows Server 2008

    Saturday, March 25, 2017 4:58 AM
  • Hi Cimmerio

    Any Chance that the engineers can login to my systems and gather logs, etc for troubleshooting.

    Regards

    Leo

    Monday, March 27, 2017 7:44 AM
  • Hi Cimmerio

    I have recently implemented a hyper-v 2016 environment and are experiencing an identical issue to your post.

    Please could you advise if you've had any feedback or a solution from Microsoft?

    I did my own testing, NIC configuration changes are not resolving the issue. Even updated the BIOS but the issue remain

    Many thanks

    Trish

    Thursday, April 20, 2017 11:39 AM
  • Any feedback?

    Wednesday, April 26, 2017 2:24 AM
  • Hi all,

    I am running:

    1 x Dell SC4020 Compellent

    3 x PowerEdge R730 – Windows 2016 Datacentre

    1 x Cluster Hyper-V 2016

    55 x VM

    I have the same issue. I had to stop the replication to ASR (Azure) and I just logged a critical case with Microsoft. I expect Microsoft to call me sometime tonight. I logged a case with Azure support but they have no idea about what is happening.  It’s very painful and frustrating.

    I will keep you posted.

    Thanks,

    Charlie

    Wednesday, April 26, 2017 10:36 AM
  • I am having the same issue too... anyone has an update on this?

    Regards,

    Dean


    Regards, Dean

    Tuesday, May 2, 2017 1:09 PM
  • Hi Tcharlie6

    Any feedback from MS regarding the issue, and has yours been resolved/

    Kind Regards

    Leo

    Monday, May 8, 2017 12:26 PM
  • I have a similar problem on server 2012R2 Hyper-V cluster.  I've diagnosed the issue to the switches.  For what ever reason the switch becomes overloaded and the virtual machine and my replica would become unresponsive when replicating.  It has gotten to the point where the switch is now causing issues with the cluster, doing live migrations, even DNS queries.  Do you have a switch on your network that is maybe overloaded?  After moving most of the cluster in my situation off that switch most of my issues have been resolved.  I did have several issues including your replica issue you described because of an overloaded switch.

    Just a thought it was the primary cause in my Hyper-V cluster issue.

    Monday, May 22, 2017 3:52 PM
  • Hi Trevor

    Thanks for the input, switches are not overloaded. In fact the Cluster is running on its own switch.

    We have created a brand new cluster using two IBM Servers. The storage is presented to the two hyper-v hosts via Microsoft ISCSI Shares and Microsoft's MPIO. The Two Servers each have six Nics divided into the following.

    1. Virtual Switch - Management / Shared with Management OS

    2. Virtual Switch - Local Area Connection - Virtual Machine LAN / Not shared with Management OS

    3. Virtual Switch - Migration Network / Shared with Management OS

    4. Virtual Switch - iSCSi Network / Shared with Management OS

    5. Virtual Switch - Cluster Network / Shared with Management OS

    6. Virtual Switch - Replication Network / Shared with Management OS

    So traffic is running on each NIC divided. We build the cluster using MS ISCSI and MS MPIO to prove that its not the SAS Raid Controllers and IBM MPIO Software that is casing the issue. This cluster is experiencing the same issue.

    Cluster - Cluster Replication / Migration Fails in Cluster from one node to another of the VM

    Cluster - Standalone Replication Host / Migration Fails in Cluster from one node to another of the VM

    Live Migration gets to about 85% and the the VM state is stopping. Just hangs and does not complete the migration. Only way the recover is to do a hard reboot of the Node.

    Any other suggestions would be welcome.

    Kind Regards

    Leo



    • Edited by LeoDuPreez Wednesday, May 24, 2017 11:58 AM
    Wednesday, May 24, 2017 11:56 AM
  • Unfortunately to this day, I still do not have a solution from Microsoft.

    In the moment that I know something I will let you know.


    MCSA: Windows Server 2008

    Monday, May 29, 2017 8:47 PM
  • Hi Leo,

    Are you running multiple WAN on the Switches?  I know that can cause some issues, I have a very similar setup to what you got going on.  I'm using 2 freenas boxes for our iSCSI targets, and websmart Allied Telesis switches (lagged).  The switch that the cluster was on can only forward packets at 46 gbps/s and it was causing a massive issue doing a live replica failover to a standalone server while running the cluster.  The replication would crash every 3rd / 4th replication and I was only doing 2 VMs.  How well does it run when you run one, I am kind of assuming your running all 80 at once in a sequence or something like that.  What if you reduced it to 10 and run them that way, wait till the batch is done and move to the next 10 etc.  I know with Veeam when the VSS backup crashes a restart of the Virtual Machine management lets you get backup and going without a reboot, but with your discription

    "Randomly live migrations fail. The virtual machine hangs in stopping state, totally inaccessible". 

    My error log with veeam shows something like this

    The writer experienced a transient error.  If the backup process is retried,
    the error may not reoccur.
    --tr:Failed to verify writers state.
    --tr:Failed to perform pre-backup tasks.

    and the server would just hang and any replication would just be stopped until the machine was rebooted.  To resolve the issue I had to literally give more time for the backups as it seemed to overwhelm the switch and I also had to delete the replicated system as Veeam got stuck on some sort of byte check and since the VM didn't transfer properly it would crash and I had to create a new replicated VM. 

    When I did a replication setup of a few servers on a different Hyper-V host (my cluster) I tried the internal replication built into hyper V and found I need to add more time for the live migration (I know this was definitely a switch bottleneck) when the live migration would randomly crash and I would need to restart the replication service on the Cluster.  Seemed to be the same issue with the VSS Writer.

    I hope you find some of this information helpful, as for the live migration, I had to reduce the amount of VMs and time between replicating for the disk IO and the network.

    You might find my veeam post helpful with this problem.  I got a feeling your VSS writer is randomly crashing.

    https://optionkey.blogspot.ca/2016/06/fixing-veeam-hyper-v-replication.html


    Tuesday, May 30, 2017 2:12 AM
  • I logged a call with Microsoft but til now, still don't have a resolution. I replicate from Hyper V 2016 cluster to Azure, if I enable replication on the VM, it will not able to shutdown, it stuck a stopping at about 80%, sometimes 90% and it just stay there forever, and after sometimes, all VM started to mis-behave and cluster just crash. I had to hard reboot all the hosts to recover.

    I've been working with Microsoft support for weeks, after some convoluted process and they finally manage to capture the dump file for analysis, they said they can see the issue but never get back to me. In the end, I decided to rollback hyper V to 2012R2 and everything just works fine. This proved there is no hardware or infrastructure issue but is Windows 2016 itself that causing issue.

    Regards,

    Dean


    Regards, Dean


    • Edited by DLMyriad Wednesday, May 31, 2017 2:09 PM
    Wednesday, May 31, 2017 2:07 PM
  • Hi Cimmerio

    I logged a call with Microsoft as well but no luck yet. They are now suspecting the VM to be the issue and not the Hosts. My VM is a Print Server

    I'm starting to think its the services or applications that we run on these VMs that are not replicating correctly

    Do you mind if I ask what your VM is running ?

    Also everyone else on the forum -  what do your VMs run? (exchange, SQL, anything application-aware?)

    I just want to see if there is some sort of coloration between the types of VMs and level of failure

    Thank you all

    Trish

    Wednesday, June 7, 2017 2:32 PM
  • Hi TechTifa / all following this issue.

    Happens to all VM's. Does not matter what roles are installed.  All my VM's are running either Server 2016,  or Server 2012 R2 config v8.0 and all VM's are generation two VM's.

    All VM's are dedicated to one role and we are running about 60 VM's at the moment. We have 1TB of memory, so memory is not an issue, neither is CPU. CPU usage on the hosts are at 0% and VM's are using between 1 and 5%. CSV Cache Volume has been increased to 10GB per blade which gives us 60GB of CSV cache for the VM's to use.

    We run all deployments according to microsoft white papers. SharePoint Servers with Dedicated DB Servers. Application Server with Dedicated DB Servers. Exchange 2016 Servers in a DAG, two dags and one witness server for the DAG. Print Servers, Domain Controllers. IIS Web Servers, Print Servers.

    At the moment all replication is disabled. This morning I enabled replication to a stand alone host using the DAG Witness Server. It migrated once to another blade with no problems. When migrating to the next blade it ended up in the stopping state. Only way to get the Virtual Machine back online was to drop the power on the Blade. I have removed the replication on the VM. Standard install of Server 2016. All VM's are patched and up to date. All Hosts are running Server 2016 Datacenter Edition, fully patched and up to date.

    I did pick up the following event on the system log of the blade that coincided with the move of the Virtual machine.

    failed to perform the 'Cleaning up stale reference point(s)' operation. The virtual machine is currently performing the following operation: 'Moving Virtual Machine'.

    Kind Regards

    Leo

    Thursday, June 8, 2017 11:46 AM
  • Hi Leo

    that blows my theory out the window then. also I experienced the same behaviour in my environment. at first the VM was only moving from node1 to node2 but not the other way.

    then I changed performance from Compression to TCP\IP - that was working for a while. now it doesn't

    one thing I did today. on the replica server I added another disk. pointed replication to that extra drive and live migration worked with replication enabled. I don't know what that means.

    do replication files need to be on their own separate drive? at the moment replication is pointing to the C:\ drive that is holding everything else

    also check in config if you have any lingering HRL files. If so, as a test Microsoft asked me to delete the files. the ones not connected to your VMs will delete without a problem. the remaining ones are the ones your vm is connected to. try live migrating after that and see if it works. it worked on my lab but not when I tried in live production.

    • Edited by TechTifa Thursday, June 8, 2017 4:37 PM
    Thursday, June 8, 2017 4:27 PM
  • Hi TT

    Have not tried taking out compression. Will do that now./ dedicated disk for replication. Will try another disk. Also live migration is not the only issue. Shut your virtual machine down that is replicated. I suggest using a test VM. You will probably experience a shut down and stopping state. Can you confirm that you have this issue as well. I suspected BIOS settings and last night we made sure that all the host has the exact same bios settings. Processors and ram running at max performance, capping disabled. VT-d enabled. DEP enabled etc.

    Same issue, will try the uncompressed and report back.

    Kind Regards

    Leo


    • Edited by LeoDuPreez Friday, June 9, 2017 9:10 AM
    Friday, June 9, 2017 9:09 AM
  • Hi TT

    Changed settings not to compress the data going over the network and moved the Replication folder to another disk. Same issue. 30 Years in the IT trade and a avid Microsoft supporter. We are an enterprise data warehouse company that supply Oracle Exadata's and Microsoft solutions to our customers. We run big SQL Clusters for our customers and I have not experienced issues like this before.

    I guess in the end its my bad decision. Server 2012 R2 was stable and mature, and moving to 2016 was based on the benefits that it added and it seems that it's been more problematic than beneficial. I also upgraded all of my VM's to Ver 8.0. Is there any way of rolling them back. Only move here is to go back to 2012 R2 and wait for them to resolve the issues in Server 2016. Setup cannot be an issue as we have two clusters running, and both are exhibiting the same issues. Other option is to invest in VEEAM, or start kicking out nodes and go back to VMware. We have the licenses and we are a VMware Partner

    Hardware affected seems to be not brand specific. Dell, HP, IBM, Lenovo, Oracle.

    My biggest issue with Server 2016 is that when you experience problems or need any advice nothing presents on the web. Searches return 2012 R2 issues etc. I contacted the Microsoft call center to inform them of a flaw in one of their products and was told to haul out my credit card and pay first before they could log the call. After moaning and bitching about SCVMM 2016 deleting the replicas of all virtual machines in the replica folder when you delete one virtual machine they eventually issued a patch. I experienced this problem for 4 months before it was resolved.

    What concerns me is that several customers on this forum is experiencing the same issue. Some has logged calls and there's no response from Microsoft regarding resolution. So lets tray and figure this one out ourselves. Other software potentially locking the replication process.

    Only additional Software on my hosts is.

    1. Backup Exec 2016 Agent (Both Clusters)

    2. System Centre Virtual Machine Manager Agent 2016 (Both Clusters)

    3. One Cluster has IBM SSDDDM Software installed for multi pathing and uses Microsoft MPIO as well.

    4. One Cluster has MPIO and uses Microaoft ISCSI Shares "Vhdx" presented to the hosts

    5. Every Virtual Machine has the Backup Exec Agent 2016 installed as well.

    Please indicate if you have the same config, Lets make Backup exec 2015 / 2016 part of the same list.

    I'm thinking of removing all BE exec agents and SCVMM agents from the hosts and then checking the replication.

    Any suggestions Guys / Microsoft

     

    Friday, June 9, 2017 10:03 AM
  • Can you check if the live migration NIC has the same name on all the nodes and also can you separate these NIC's (livemigration + cluster) and check?

    Cheers! Sachin Kumar Associate Consultant (Windows)

    Friday, June 9, 2017 11:08 AM
  • Hi Sachin, as posted before.

    1. Virtual Switch - Management / Shared with Management OS

    2. Virtual Switch - Local Area Connection - Virtual Machine LAN / Not shared with Management OS

    3. Virtual Switch - Migration Network / Shared with Management OS

    4. Virtual Switch - iSCSi Network / Shared with Management OS

    5. Virtual Switch - Cluster Network / Shared with Management OS

    6. Virtual Switch - Replication Network / Shared with Management OS


    Each one uses its own nic on its own subnet
    • Edited by LeoDuPreez Monday, June 12, 2017 6:41 AM
    Monday, June 12, 2017 6:40 AM
  • Probably need to check the issue on your server to understand it in a better manner and then proceed further.

    Cheers! Sachin Kumar Associate Consultant (Windows)

    Monday, June 12, 2017 9:42 AM
  • Hi Cimmerio,

    This isn't the first issue I've heard of with this in server 2016, and in the end the other system rolled back to 2012R2 as well.  I don't have backup exec running but I was not running server 2016 either but I am looking to upgrade to it for the REFS Features... Speaking of which are you using REFS or are they all NTFS/EXFAT/Something else?  What kind of errors are your VMMS log throwing you?  Since this is all scripted I wouldn't be surprised if you have to put in a manual time pause say 3-5 minutes after a big VM replication.  It could be the case that the script is overloading the VMMS and replication service making it run faster than it should.  Can you post some error logs from your replications and Hyper-V manager.  I am assuming that it runs fine after say 5, 10 VMs then crashes?  Do you have any of that data?  How about splitting the VM's up in smaller chunks say 9/10 per script that run at specified intervals?  I would be very interested in seeing if it crashes on the smaller script load.  Do you have a lab you can kind of replicate this in?

    Monday, June 12, 2017 2:34 PM
  • Hello everyone

    I spoke to Microsoft support. On all the Hosts - They asked me to uninstall my Anti-virus, restart, enable windows defender and once that's enabled run the following in powershell (to add exclusions)

    Set-MpPreference -ExclusionPath c:\clusterstorage, %ProgramData%\Microsoft\Windows\Hyper-V, %ProgramFiles%\Hyper-V, %SystemDrive%\ProgramData\Microsoft\Windows\Hyper-V\Snapshots, "%Public%\Documents\Hyper-V\Virtual Hard Disks"

    Set-MpPreference -ExclusionProcess %systemroot%\System32\Vmwp.exe, %systemroot%\System32\Vmms.exe -Force

    Set-MpPreference -ExclusionExtension *.vhd, *.vhdx, *.avhd, *.avhdx, *.vsv, *.iso, *.rct, *.vmrs, *.vmcx


    I did this and initiated live migration which worked great between the nodes. I live migrated at least 10 times. Live migration with replication enabled is successful.

    It looks like live migration with windows defender enabled - works

    Can someone try the same thing and let me know how you get on. Can Windows Defender really be the culprit??.


    • Edited by TechTifa Monday, June 12, 2017 4:24 PM
    • Proposed as answer by Trevor TyeMVP Tuesday, June 13, 2017 12:58 AM
    • Marked as answer by Cimmerio Monday, June 19, 2017 3:49 PM
    • Unmarked as answer by Cimmerio Tuesday, June 20, 2017 2:44 PM
    • Marked as answer by Cimmerio Tuesday, August 13, 2019 7:50 AM
    Monday, June 12, 2017 4:22 PM
  • hi Leo and everyone

    please check my last post and if you can test windows defender in your lab that would be much appreciated.

    windows defender is enabled by default until you install your own anti-virus.

    I just need someone else to confirm this

    Thanks

    TT


    • Edited by TechTifa Monday, June 12, 2017 4:31 PM
    Monday, June 12, 2017 4:30 PM
  • I will confirm this tonight I am 99.99% sure this is the case but I'll confirm anyways.  Glad it's fixed!  Let me know if get get any errors overnight!  That's the real test!
    Tuesday, June 13, 2017 1:00 AM
  • Server 2012R2 doesn't come with windows defender so its a mute point and not installed.  However I can totally see this as being the issue with the migration being halted by the AV.  So if you were going to go back to your AV and exclude those file types from scanning, you should be able to use your own AV (if you want).  Mind if I blog about this?

    Thanks!

    Trevor

    • Proposed as answer by HUNTUNmzeri Tuesday, April 10, 2018 8:55 AM
    • Unproposed as answer by HUNTUNmzeri Tuesday, April 10, 2018 8:56 AM
    Tuesday, June 13, 2017 4:29 AM
  • Please reproduce the issue (give me the details of replication from which server to which server), note the exact the time of the failure then run the below link on any node, follow the instructions (select all the nodes) and upload the logs:

    https://diagnostics.support.microsoft.com/diagprov/provision/ASH.0B33E7C034353973666E495279350B.Run.exe?_tenant=ash&passkey=0B33E7C034353973666E495279350B&_ext=.exe

    I will check the logs and get back to you.


    Thanks! Sachin Kumar (Associate Consultant-Windows)

    Tuesday, June 13, 2017 6:28 AM
  • hi Trevor.

    this forum is mainly on migration problems faced on Windows server 2016. On 2016, Windows defender is enabled by default until you install your own AV

    my AV already had the exclusions but live migration still failed.

    now I have enabled windows defender with exclusions and also install my AV on top. made sure windows defender is running.

    Live migration is still successful. its not ideal for 2 AVs to run simultaneously but it looks like that's what needs to be done.

    ....until we find a work around... or Microsoft provide a Hotfix

    will keep you posted

    Tuesday, June 13, 2017 9:56 AM
  • Hi Guys and Girls

    No AV installed on any of my hosts, but Windows Defender was uninstalled completely from each host. Installing Windows Defender now and adding the exclusions. Might just be that even if WD is uninstalled it may leave some kind of footprint behind, engine etc that causes this. Will report back.

    Kind Regards

    Leo

    Tuesday, June 13, 2017 12:49 PM
  • First Cluster Done, Can confirm that on test VM has replicated I I can shut it down and migrate it Cluster successfully. Re-installed Windows Defender on Nodes and Stand Alone Host Replica Server, added the exclusions on the Hyper-V Cluster Nodes and the standalone replica Server

    WD... lets wait for the other replicas to complete.

    Update:

    2 x VM's replicated can shut them down successfully without ending up in a stopping state. Waiting for migrations of Storage to complete back into the cluster and then Live migrations will be tested on the other Virtual Machines. Best its been up to now

    Update:

    All VM's back in our test cluster, migrations are faster. Also no stopping states during live migrations between nodes with replication enabled. Migrated 4 VM's from host to host ... no problems. Live migrations is fast. Migrating Multiple machines at once from host to host, successful. Enabling more replica's and will test

    Update:

    TEST Cluster Done... now a Production Cluster. All VM's migrating with no problem. WHAT A PLEASURE. Moving the PRODUCTION Clusters now.

    Will be updating this post as I move to the Production Clusters.

    Conclusion:

    1. Seems that uninstalling WD has the same effect as installing AV Software that disables WD.

    2. Why Microsoft... why include Windows Defender in a Server OS.... which breaks it.

    3. I still stand by Microsoft, what a nice product.. Windows Server 2016.

    4. Now... please explain to us why Windows Defender would do this or have you integrated it to such an extend that it becomes OS affecting.

    5. All our VM's running has AV installed, so WD has been disabled. What's the effect on the VM's and should we take any remedial action on the VM's that we deployed our own AV solutions to that we paid $$$$$ for.

    6. MICROSOFT RECOMMENDATIONS - DO NOT RUN ANY AV ON HYPER-V HOST UNLESS ABSOLUTELY REQUIRED - Explain this one to me :-)









    • Edited by LeoDuPreez Tuesday, June 13, 2017 2:00 PM
    Tuesday, June 13, 2017 1:19 PM
  • There must be a significant change in 2016 for the AV to act like that.  I would get in touch with your AV provider and let them know about the issue, and I would see if MS can give you some free licensing for windows defender till your AV provider can get a fix for you or going to have to license both.   I think you got this licked though.
    Tuesday, June 13, 2017 1:47 PM
  • Hi TT

    Production Cluster has been done, one VM replicating and migrating without issues.

    It's a Windows Defender issue, no question. Have not seen migration working like this on Server 2016 but on 2012 R2 yes.

    We did not install our AV on the nodes, we just uninstalled Windows Defender from roles, Features. We do not run AV on our host as Microsoft recommends it. We harden the OS as well.

    TT will update you but thank you very much for posting and getting this issue solved.

    Kind Regards

    Leo

    Tuesday, June 13, 2017 3:40 PM
  • Update;

    Actions taken: No hosts had AV on them and we removed Windows Defender: Reinstalled WD, and added exclusions and all hosts. All VM's replicating successfully, migrating successfully between nodes and shutting down successfully.

     And Girls and Boys... that's the END

    Happy Virtualizing...... 

    Kind Regards

    Leo 

    Wednesday, June 14, 2017 12:27 PM
  • @ Sachin Kumar (Associate Consultant-Windows)

    Hi Sachin, picked up another issue related to replication. This has to do with SCVMM 2016. Replicated VM's in SCVMM 2016 /2012 R2 shows a memory demand of 0.


    • Edited by LeoDuPreez Wednesday, June 14, 2017 2:31 PM
    Wednesday, June 14, 2017 2:25 PM
  • Months ago I disabled windows defend on all servers. I also created the exclusions, but it did not work. I never tried uninstalling the Windows Defender feature, and it looks like it RUNS.

    For hours I have scheduled migrations every 2-3 minutes with the active replicas and there has been no failure.

    It is a joy to see that it works.

    Thank you very much to all.

    MCSA: Windows Server 2008

    Monday, June 19, 2017 10:05 PM
  • Unfortunately I was wrong.

    After putting into production the servers with windows defender uninstalled and enable the replica in 4 virtual machines, one of it was in state "stopping".

    If virtual machines do not have replication enabled, there are no problems. We are as at first :-(

    MCSA: Windows Server 2008


    • Edited by Cimmerio Tuesday, June 20, 2017 3:05 PM
    Tuesday, June 20, 2017 3:04 PM
  • Is it random between the 4 vms that are stopping?
    Wednesday, June 21, 2017 4:28 AM
  • Hi Cimmerio

    The fix is when you ENABLE windows defender and add the exclusions. Trying to uninstall it or disabling it doesn't help. 2016 requires windows defender to be running for live migration to work with replication enabled

    This will work

    Tuesday, June 27, 2017 9:07 AM
  • Yes, it is.

    MCSA: Windows Server 2008

    Thursday, July 6, 2017 8:50 AM
  • ok. I'll try it.

    MCSA: Windows Server 2008

    Thursday, July 6, 2017 8:51 AM
  • It's working fine :-)

    Thank you very much.


    MCSA: Windows Server 2008

    Monday, July 17, 2017 12:51 PM
  • Thank you everyone for the information provided here. It has been helpful to me as I'm experiencing this issue or something similar now. VM in ‘stopping’ state when it should be Live Migrated. To get VM’s back online the host must be rebooted but won’t go down until I remote force stop VMMS.exe via taskkill command.

    Currently running 2 node S2D setup running 2 Server 2016 VM's with IIS services.  

    For weeks, even months now I've had replication enabled and very proud doing VM Live Migrations, Planned Failovers, Planned Failovers with reverse replication, also Planned Failback with replication. All good.

    At first all without virus protection, defender status unknown as I just learned about it being a suspect via this article. I added other antivirus with recommended exceptions and have been able to do some migrations after the antivirus install. All working until it broke. I added Storage Replica feature but didn’t enable Cluster Storage Replication as I had no storage available. I also added a SCVMM 2016 Server to help manage my S2D Servers and associated VM's, I'm not sure how SCVMM or Storage Replica could affect my S2D migrations but it’s a change introduced so I felt need to mention it.

    I just spent almost 4 hours on the phone with MS Support looking into what’s wrong. No idea. No resolution.

    We've removed all antivirus (defender that was listed in services but not running has been removedfrom nodes, not sure if it was ever running even before other antivirus was installed). After this article I will pay more attention to defender and its status. For now defender has been removed from server manager features and other antivirus has been uninstalled.

    VM Replication was removed and replication broker role was stopped and removed. Bare bones -  no antivirus at all and no replication enabled. VM migration is now successful again.

    I’ve now successfully enabled VM Replication again, Cluster and VM’s are normal healthy status again.

    Live Migrations now fail with replication enabled (unless I’m missing something this same setup used to work). MS Support has transferred me to Network Team for netmon trace analysis to see what’s going on. Only other change I can think to mention is Windows Server 2016 updates we install as they are available.

    I’m strongly considering trying defender as TechTifa has mentioned here. There’s a lot that doesn’t make sense to me and this ‘solution from MS’ fits that description. Why would we need defender to run a live migration with replication enabled? Has an explanation been given?

    Thanks again for everyone's help here.

    Hello everyone

    I spoke to Microsoft support. On all the Hosts - They asked me to uninstall my Anti-virus, restart, enable windows defender and once that's enabled run the following in powershell (to add exclusions)

    Set-MpPreference -ExclusionPath c:\clusterstorage, %ProgramData%\Microsoft\Windows\Hyper-V, %ProgramFiles%\Hyper-V, %SystemDrive%\ProgramData\Microsoft\Windows\Hyper-V\Snapshots, "%Public%\Documents\Hyper-V\Virtual Hard Disks"

    Set-MpPreference -ExclusionProcess %systemroot%\System32\Vmwp.exe, %systemroot%\System32\Vmms.exe -Force

    Set-MpPreference -ExclusionExtension *.vhd, *.vhdx, *.avhd, *.avhdx, *.vsv, *.iso, *.rct, *.vmrs, *.vmcx

    I did this and initiated live migration which worked great between the nodes. I live migrated at least 10 times. Live migration with replication enabled is successful.

    It looks like live migration with windows defender enabled - works

    Can someone try the same thing and let me know how you get on. Can Windows Defender really be the culprit??.

     

     

     

     

     

     

     

     

     

     


    Wednesday, July 26, 2017 5:46 PM