I have a VMM 2012 SP1 environment which has a 2-node management server cluster and a 2-node SQL Server 2008 R2 SP2 cluster, with the VMM servers running Windows Server 2012 and the database cluster running Windows Server 2008 R2 SP1. I had previously configured this environment on our test network without any issues and was able to add and manage Citrix XenServer hosts without any issues, however, am I now having problems on our production environment after setting up the environment there. I had a previous post which resolved some issues on the test environment, and which can be found here and have gone through all the checks mentioned in it previously (the main ones begin regarding certificates, but this part of the process is now working fine):
http://social.technet.microsoft.com/Forums/en-US/749522f6-4b1e-42c2-9109-a61e1126e147/error-2916-adding-xenserver-host-to-vmm-2012-sp1-on-windows-server-2012, but I am now experiencing the following problem:
When I add any of our Citrix XenServer 6 pools to the VMM fabric, the 'Add Virtual Machine Host' jobs that appear in the VMM console appear to freeze at 66% complete, with the 'Refresh host' task stuck on 0%. A 'Refresh Host Cluster' job also appears and that also freezes at 0%. Eventually these jobs fail, the VMM service stops, the VMM console crashes, and the VMM failover cluster role will either just restart the service or - if the failover threshold for the day has been reached - failover to the other node. The hosts appear in the VMM fabric but cannot be used and have the status of 'needs attention'.
The registry key mentioned in the previous post linked to above has been added to these VMM servers, the Citrix XenServer hosts all have the integration pack installed, their hostnames have all been set at install time to be FQDNs and the certificates reflect this and contain the correct FQDN. The hosts have entries in DNS in both forward and reverse lookup zones (using nslookup to check these provides positive results), and all the other checks in my previous post have been carried out. Clearly there is a communication problem between VMM and the XenServer hosts, but I do not know where. Checking the VMM logs on the VMM servers, I have found two events that correspond to what has happened, here are the details (I haven't included the binary line in the first error's XML data as it is VERY long):- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">- <System><Provider Name="Virtual Machine Manager" /><EventID Qualifiers="0">1</EventID><Level>2</Level><Task>0</Task><Keywords>0x80000000000000</Keywords><TimeCreated SystemTime="2013-06-20T09:39:28.000000000Z" /><EventRecordID>14625</EventRecordID><Channel>VM Manager</Channel><Computer>HOSTNAME FQDN HERE</Computer><Security /></System>- <EventData><Data>System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.Carmine.XenImplementation.XenVirtualNetworkSwitch.get_DeviceID() at Microsoft.VirtualManager.Engine.Adhc.HostRefresher.<>c__DisplayClass29.<GatherVirtualNetworkData>b__27(IVirtualNetworkSwitch sw) at System.Collections.Generic.List`1.ConvertAll[TOutput](Converter`2 converter) at Microsoft.VirtualManager.Engine.Adhc.HostRefresher.GatherVirtualNetworkData(Host host, ITaskContext taskContext) at Microsoft.VirtualManager.Engine.Adhc.HostRefresher.GatherVirtualNetworkInformation(Host host, List`1 networkAdapters, List`1 vNics, ITaskContext taskContext) at Microsoft.VirtualManager.Engine.Adhc.HostRefresher.GatherAllInformation(Host host, Object agentRefreshSyncObj, Boolean checkIfClustered, Boolean refreshEventCapabilities, String& clusterName, Guid taskID, ITaskContext taskContext) at Microsoft.VirtualManager.Engine.Adhc.HostRefresher.RefreshLockedHost(Host host, Guid taskID, ITaskContext taskContext, Boolean checkClusterStatus) at Microsoft.VirtualManager.Engine.Adhc.RefreshVmHostSubtask.RunSubtask() at Microsoft.VirtualManager.Engine.TaskRepository.SubtaskBase.Run() at Microsoft.VirtualManager.Engine.Adhc.AddHostSubtask.RunSubtask() at Microsoft.VirtualManager.Engine.TaskRepository.SubtaskBase.Run() at Microsoft.VirtualManager.Engine.TaskRepository.Task`1.SubtaskRun(Object state)-2147467261</Data>
<Binary> DATA HERE </Binary></EventData></Event>- <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">- <System><Provider Name="Virtual Machine Manager" /><EventID Qualifiers="0">19999</EventID><Level>2</Level><Task>0</Task><Keywords>0x80000000000000</Keywords><TimeCreated SystemTime="2013-06-20T09:39:28.000000000Z" /><EventRecordID>14626</EventRecordID><Channel>VM Manager</Channel><Computer>HOSTNAME FQDN HERE</Computer><Security /></System>- <EventData><Data>Virtual Machine Manager (vmmservice:1612) has encountered an error and needed to exit the process. Windows generated an error report with the following parameters: Event:VMM20 P1(appName):vmmservice P2(appVersion):3.1.6020.0 P3(assemblyName):XenImplementation P4(assemblyVer):3.1.6018.0 P5(methodName):M.C.X.XenVirtualNetworkSwitch.get_DeviceID P6(exceptionType):System.NullReferenceException P7(callstackHash):7aae .</Data><Data>1612</Data><Data>VMM20</Data><Data>vmmservice</Data><Data>3.1.6020.0</Data><Data>XenImplementation</Data><Data>3.1.6018.0</Data><Data>M.C.X.XenVirtualNetworkSwitch.get_DeviceID</Data><Data>System.NullReferenceException</Data><Data>7aae</Data></EventData></Event>
These errors mention issues retrieving the network switch information, and so based on another post that I found with a similar issue I have tried removing the option in VMM to automatically create logical networks and virtual switches when adding new clusters, but I get the same result. It should be noted that we have bonds set up between pairs of NICs on the XenServer hosts and they are running XenServer 6.0.201.
I have tried updating the VMM components to Update Rollup 2, but this has also not resolved the issue. Some Hyper-V hosts that I have running Windows Server 2012 can be added without any issues.
I would greatly appreciate any help that anyone can give me with this issue.
UPDATE: As an additional note, I have just tried adding another XenServer pool which does NOT use bonds on the NICs at all, is running on completely different hardware, and I have found that they do not get this issue, so it looks likely that the bonds/VLANs or something else to do with the network hardware that I am using on the original pool is causing this issue.
I have sufficient capacity on the pool to be able to remove a host from the pool, reconfigure the networking, and then try again, so I will do that and report back.
UPDATE2: I have tried removing a host from the XenServer pool that I am having issues with, and can add that individual host without any problems. I tried setting up the bonds and VLAN settings that are present when it is part of the pool, and it can STILL be added without any problems, so I'm not quite sure what the issue is - I have tried changing the MTU settings to ensure they match, have changed the 'automatically add to new virtual machines' setting, and yet when I try to add the pool using any of the hosts it still has the previous problem.
I may try changing the pool master in case there is a problem with that machine, but if anyone has any other ideas, please let me know!