Live migration gets slower the longer a cluster node runs
-
Donnerstag, 12. April 2012 12:58
2 Node cluster node Windows 2008 R2 Enterprise with SP1 and all updates
48GB memory, maximum 45% used
Node with problem: DELL R710 with 2 Intel Xeon E5540@2,53GHz and 48GB RAM
Node without problem: DELL R710 with 2 Intel Xeon E5506@2.13GHz and 48GB RamWhen the node with the E5540 is running longer than ~one weeks the live migration gets very slow.
There is a dedicated 1GB NIC for migration only.At the moment I see these resources used (resource monitor):
svchost (termsvcs) uses about 4000 b/sec (sent)
ckussvc usws 2000 sent amd 800 receivedThe network overview shows constact spikes between 20kbps and ~1300kbps
So its only 1 Mbps!!!
The complete server has nothing to do: 0-100kb/sec hard disk, as good as no other network traffic, 0 memory hard faults/sec, total cpu 0%-12%
All virtuall machines reside on a SAN connected with FC.When I reboot the server live migration goes super fast again.
I have no idea why it get's this slow over time :(
- Bearbeitet Morgenstern72 Donnerstag, 12. April 2012 12:58
- Bearbeitet Morgenstern72 Donnerstag, 12. April 2012 13:02
- Bearbeitet Morgenstern72 Donnerstag, 12. April 2012 13:12
- Bearbeitet Morgenstern72 Montag, 16. April 2012 08:35 clarifiying problem
- Bearbeitet Morgenstern72 Donnerstag, 26. April 2012 08:18 more details
Alle Antworten
-
Donnerstag, 12. April 2012 15:32
Just to give you numbers:
to transfer 9 servers with this amounts of memory
Server 2008 R2: 5120, 2560, 10240
Win XP: 1024, 4096, 768, 768, 768, 768, 768it took one node about 2 hours to live migrate them to the second node before the reboot and about 5 minutes to live migrate them back after the reboot.
Only one of two nodes were rebooted!- Bearbeitet Morgenstern72 Donnerstag, 12. April 2012 15:32
-
Freitag, 13. April 2012 15:10
Try this patch:
http://support.microsoft.com/kb/2517329
E5540 should be nehalem/westmere-ep so this likely applies to you.
J
-
Freitag, 13. April 2012 21:46
Hi John,
thank you for this tipp.
Thing is: According to http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)#Westmere a E5540 is not a Westmere.
And: we experience none of the symptoms in the article- The CPU usage is high -> no, never
- and the server responds slowly when you copy large files on the computer. For example, you copy a 10-GB file. -> not, its blazing fast
- The disk I/O performance of the virtual machines (VMs) is slow. -> no, they are very fast
- Windows takes a long time to start. -> we experienced this about 2-3 times after many updates
Only live migration is slow.
Since MS advises you to only use a hotfix when you have the described situation and we don't even have a Westmere (or I am informed wrong) it seem not to bee a good idea to use this hotfix? -
Freitag, 13. April 2012 22:09
E5540 is Gainstown which is a Westmere (nehalem-c) variant
http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
Check the charts in link.
J
-
Freitag, 13. April 2012 22:23
E5540 is Gainstown which is a Westmere (nehalem-c) variant
http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
Check the charts in link.
J
Maybe I am getting this al wrong, but I can't see this in your link or in this link: http://en.wikipedia.org/wiki/Xeon#5500-series_.22Gainestown.22
5500-series "Gainestown"
Gainestown Produced From 2008 to present Max. CPU clock rate 1866 MHz to 3333 MHz Min. feature size 45 nm Instruction set x86 Microarchitecture Nehalem CPUID code 106Ax Product code 80602 Cores 4 L2 cache 4x256 KB L3 cache 8 MB Application DP Server Package(s) Brand name(s) - Xeon 55xx
Gainestown or Nehalem-EP, the successor to the Xeon Core microarchitecture, is based on the Nehalem microarchitecture and uses the same 45 nm manufacturing methods as Intel's Penryn. The first processor released with the Nehalem microarchitecture is the desktop Intel Core i7, which was released in November 2008. Server processors of the Xeon 55xx range were first supplied to testers in December 2008.<sup class="reference" id="cite_ref-20" style="line-height:1em;">[21]</sup>
3600/5600-series "Gulftown"
Gulftown or Westmere-EP, a six-core 32 nm Westmere-based processor, is the basis for the Xeon 36xx and 56xx series and the Core i7-980X. It launched in the first quarter of 2010. The 36xx-series follows the 35xx-series Bloomfield uni-processor model while the 56xx-series follows the 55xx-series Gainestown dual-processor model and both are socket compatible to their predecessors.
Intel tells me, that my CPU is a 45nm (http://ark.intel.com/products/37104/Intel-Xeon-Processor-E5540-(8M-Cache-2_53-GHz-5_86-GTs-Intel-QPI)) and Westmere must be "Westmere (formerly Nehalem-C) is the name given to the 32 nm die shrink of Nehalem." according to your own link.
Now I am really confused why my CPU should be a Westmere?- Bearbeitet Morgenstern72 Freitag, 13. April 2012 22:23
-
Samstag, 14. April 2012 01:45
Westmere is a die shrink of Nehalem-C
-
Samstag, 14. April 2012 02:56
Firguring out which features are on which processor in the Nehalem family (of which Westmere is a subset) can be very confusing. What isn't confusing is the fact that I built a Fast Track Sumission last year, and part of the blade selection process involved Cisco B200 M1 and M2 blades. The M1 is 5500 series intel processor based, and the M2 is 5600 series based. We had this specific problem with both. This is the link to the NetApp and Cisco Fast Track White Paper: http://media.netapp.com/documents/wp-7132.pdf
- Bearbeitet John Fullbright Samstag, 14. April 2012 02:56
-
Samstag, 14. April 2012 14:16
Hi,
Can you provide what NIC's are been used on the servers.
Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
-
Montag, 16. April 2012 07:19
Node 1 (main problem): both Nodes are DELL R710 with 1 Intel Gigabit ET Quad Port onboard and one BCM5709C NetXtreme II PCIE Card.Hi,
Can you provide what NIC's are been used on the servers.
Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
Nothing is teamed. Live Migration has it's own NIC port (direct connected to each other). -
Donnerstag, 19. April 2012 11:11
After installing the fix (and rebooting the server node) live migration was (like always) very fast.
I waited some days and tried again. Just as slow as before :(
So the hotfix did nothing for us. -
Donnerstag, 26. April 2012 08:16Anyone has any other idea?
-
Donnerstag, 26. April 2012 09:27
No its even worse:
'Virtual Machine cemmsrv119 Ex = no online Snapshot' live migration did not succeed at the source.
live migration of a server with 10GB RAM failed
Live migration did not succeed: The operation timed out.
ID 21502
Source: Hyper-V-High-Availability -
Freitag, 27. April 2012 14:16
Had a consultant here that pointed to this post: http://social.technet.microsoft.com/Forums/en-AU/winserverhyperv/thread/a6063ff0-38b9-46ae-8e98-6d017c0c0e75
Done these things
Installed Westemere Hotfix on second Node tooWindows Power Options -> High Performance http://support.microsoft.com/kb/2207548/en-us
BIOS
Both BIOS updated
C states und C1E disabled
Power Management to "OS Controlled"NICs
all Driver updated
enabled „virtual machine queues“ on all nics with VM activity (does tcp offload from VM to host)
Jumbo Frame to 9000 on the CSV network (live migration)
all NICs: Flow Control & 8 Receive scaling queues, Power Management disabled
Now I have up to 965MBit on a 1GB NIC while live migrating.
Let's see if it lasts. -
Montag, 7. Mai 2012 13:37
It did not last, but it's still better than before:
A 8GB RAM Server now transfers in about 8 min with ~200Mbps.
This ist still better than I had started with, but 1/5 of the tests directly after restarting the servers.
-
Montag, 7. Mai 2012 19:14

