Live migration gets slower the longer a cluster node runs
-
Thursday, April 12, 2012 12:58 PM
2 Node cluster node Windows 2008 R2 Enterprise with SP1 and all updates
48GB memory, maximum 45% used
Node with problem: DELL R710 with 2 Intel Xeon E5540@2,53GHz and 48GB RAM
Node without problem: DELL R710 with 2 Intel Xeon E5506@2.13GHz and 48GB RamWhen the node with the E5540 is running longer than ~one weeks the live migration gets very slow.
There is a dedicated 1GB NIC for migration only.At the moment I see these resources used (resource monitor):
svchost (termsvcs) uses about 4000 b/sec (sent)
ckussvc usws 2000 sent amd 800 receivedThe network overview shows constact spikes between 20kbps and ~1300kbps
So its only 1 Mbps!!!
The complete server has nothing to do: 0-100kb/sec hard disk, as good as no other network traffic, 0 memory hard faults/sec, total cpu 0%-12%
All virtuall machines reside on a SAN connected with FC.When I reboot the server live migration goes super fast again.
I have no idea why it get's this slow over time :(
- Edited by Morgenstern72 Thursday, April 26, 2012 8:18 AM more details
All Replies
-
Thursday, April 12, 2012 3:32 PM
Just to give you numbers:
to transfer 9 servers with this amounts of memory
Server 2008 R2: 5120, 2560, 10240
Win XP: 1024, 4096, 768, 768, 768, 768, 768it took one node about 2 hours to live migrate them to the second node before the reboot and about 5 minutes to live migrate them back after the reboot.
Only one of two nodes were rebooted!- Edited by Morgenstern72 Thursday, April 12, 2012 3:32 PM
-
Friday, April 13, 2012 3:10 PM
Try this patch:
http://support.microsoft.com/kb/2517329
E5540 should be nehalem/westmere-ep so this likely applies to you.
J
-
Friday, April 13, 2012 9:46 PM
Hi John,
thank you for this tipp.
Thing is: According to http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)#Westmere a E5540 is not a Westmere.
And: we experience none of the symptoms in the article- The CPU usage is high -> no, never
- and the server responds slowly when you copy large files on the computer. For example, you copy a 10-GB file. -> not, its blazing fast
- The disk I/O performance of the virtual machines (VMs) is slow. -> no, they are very fast
- Windows takes a long time to start. -> we experienced this about 2-3 times after many updates
Only live migration is slow.
Since MS advises you to only use a hotfix when you have the described situation and we don't even have a Westmere (or I am informed wrong) it seem not to bee a good idea to use this hotfix? -
Friday, April 13, 2012 10:09 PM
E5540 is Gainstown which is a Westmere (nehalem-c) variant
http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
Check the charts in link.
J
-
Friday, April 13, 2012 10:23 PM
E5540 is Gainstown which is a Westmere (nehalem-c) variant
http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
Check the charts in link.
J
Maybe I am getting this al wrong, but I can't see this in your link or in this link: http://en.wikipedia.org/wiki/Xeon#5500-series_.22Gainestown.22
5500-series "Gainestown"
Gainestown Produced From 2008 to present Max. CPU clock rate 1866 MHz to 3333 MHz Min. feature size 45 nm Instruction set x86 Microarchitecture Nehalem CPUID code 106Ax Product code 80602 Cores 4 L2 cache 4x256 KB L3 cache 8 MB Application DP Server Package(s) Brand name(s) - Xeon 55xx
Gainestown or Nehalem-EP, the successor to the Xeon Core microarchitecture, is based on the Nehalem microarchitecture and uses the same 45 nm manufacturing methods as Intel's Penryn. The first processor released with the Nehalem microarchitecture is the desktop Intel Core i7, which was released in November 2008. Server processors of the Xeon 55xx range were first supplied to testers in December 2008.<sup class="reference" id="cite_ref-20" style="line-height:1em;">[21]</sup>
3600/5600-series "Gulftown"
Gulftown or Westmere-EP, a six-core 32 nm Westmere-based processor, is the basis for the Xeon 36xx and 56xx series and the Core i7-980X. It launched in the first quarter of 2010. The 36xx-series follows the 35xx-series Bloomfield uni-processor model while the 56xx-series follows the 55xx-series Gainestown dual-processor model and both are socket compatible to their predecessors.
Intel tells me, that my CPU is a 45nm (http://ark.intel.com/products/37104/Intel-Xeon-Processor-E5540-(8M-Cache-2_53-GHz-5_86-GTs-Intel-QPI)) and Westmere must be "Westmere (formerly Nehalem-C) is the name given to the 32 nm die shrink of Nehalem." according to your own link.
Now I am really confused why my CPU should be a Westmere?- Edited by Morgenstern72 Friday, April 13, 2012 10:23 PM
-
Saturday, April 14, 2012 1:45 AM
Westmere is a die shrink of Nehalem-C
-
Saturday, April 14, 2012 2:56 AM
Firguring out which features are on which processor in the Nehalem family (of which Westmere is a subset) can be very confusing. What isn't confusing is the fact that I built a Fast Track Sumission last year, and part of the blade selection process involved Cisco B200 M1 and M2 blades. The M1 is 5500 series intel processor based, and the M2 is 5600 series based. We had this specific problem with both. This is the link to the NetApp and Cisco Fast Track White Paper: http://media.netapp.com/documents/wp-7132.pdf
- Edited by John Fullbright Saturday, April 14, 2012 2:56 AM
-
Saturday, April 14, 2012 2:16 PM
Hi,
Can you provide what NIC's are been used on the servers.
Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
-
Monday, April 16, 2012 7:19 AM
Node 1 (main problem): both Nodes are DELL R710 with 1 Intel Gigabit ET Quad Port onboard and one BCM5709C NetXtreme II PCIE Card.Hi,
Can you provide what NIC's are been used on the servers.
Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
Nothing is teamed. Live Migration has it's own NIC port (direct connected to each other). -
Thursday, April 19, 2012 11:11 AM
After installing the fix (and rebooting the server node) live migration was (like always) very fast.
I waited some days and tried again. Just as slow as before :(
So the hotfix did nothing for us. -
Thursday, April 26, 2012 8:16 AMAnyone has any other idea?
-
Thursday, April 26, 2012 9:27 AM
No its even worse:
'Virtual Machine cemmsrv119 Ex = no online Snapshot' live migration did not succeed at the source.
live migration of a server with 10GB RAM failed
Live migration did not succeed: The operation timed out.
ID 21502
Source: Hyper-V-High-Availability -
Friday, April 27, 2012 2:16 PM
Had a consultant here that pointed to this post: http://social.technet.microsoft.com/Forums/en-AU/winserverhyperv/thread/a6063ff0-38b9-46ae-8e98-6d017c0c0e75
Done these things
Installed Westemere Hotfix on second Node tooWindows Power Options -> High Performance http://support.microsoft.com/kb/2207548/en-us
BIOS
Both BIOS updated
C states und C1E disabled
Power Management to "OS Controlled"NICs
all Driver updated
enabled „virtual machine queues“ on all nics with VM activity (does tcp offload from VM to host)
Jumbo Frame to 9000 on the CSV network (live migration)
all NICs: Flow Control & 8 Receive scaling queues, Power Management disabled
Now I have up to 965MBit on a 1GB NIC while live migrating.
Let's see if it lasts. -
Monday, May 07, 2012 1:37 PM
It did not last, but it's still better than before:
A 8GB RAM Server now transfers in about 8 min with ~200Mbps.
This ist still better than I had started with, but 1/5 of the tests directly after restarting the servers.
-
Monday, May 07, 2012 7:14 PM

