locked
Live migration gets slower the longer a cluster node runs RRS feed

  • Question

  • 2 Node cluster node Windows 2008 R2 Enterprise with SP1 and all updates
    48GB memory, maximum 45% used
    Node with problem: DELL R710 with 2 Intel Xeon E5540@2,53GHz and 48GB RAM
    Node without problem: DELL R710 with 2 Intel Xeon E5506@2.13GHz and 48GB Ram

    When the node with the  E5540 is running longer than ~one weeks the live migration gets very slow.
    There is a dedicated 1GB NIC for migration only.

    At the moment I see these resources used (resource monitor):
    svchost (termsvcs) uses about 4000 b/sec (sent)
    ckussvc usws 2000 sent amd 800 received

    The network overview shows constact spikes between 20kbps and ~1300kbps

    So its only 1 Mbps!!!

    The complete server has nothing to do: 0-100kb/sec hard disk, as good as no other network traffic, 0 memory hard faults/sec, total cpu 0%-12%
    All virtuall machines reside on a SAN connected with FC.

    When I reboot the server live migration goes super fast again.

    I have no idea why it get's this slow over time :(






    • Edited by Joaquin72 Thursday, April 26, 2012 8:18 AM more details
    Thursday, April 12, 2012 12:58 PM

All replies

  • Just to give you numbers:

    to transfer 9 servers with this amounts of memory

    Server 2008 R2: 5120, 2560, 10240
    Win XP: 1024, 4096, 768, 768, 768, 768, 768

    it took one node about 2 hours to live migrate them to the second node before the reboot and about 5 minutes to live migrate them back after the reboot.

    Only one of two nodes were rebooted!
    • Edited by Joaquin72 Thursday, April 12, 2012 3:32 PM
    Thursday, April 12, 2012 3:32 PM
  • Try this patch:

    http://support.microsoft.com/kb/2517329

    E5540 should be nehalem/westmere-ep so this likely applies to you.

    J

    Friday, April 13, 2012 3:10 PM
  • Hi John,

    thank you for this tipp.

    Thing is: According to http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)#Westmere a E5540 is not a Westmere.
    And: we experience none of the symptoms in the article

    • The CPU usage is high -> no, never
    • and the server responds slowly when you copy large files on the computer. For example, you copy a 10-GB file. -> not, its blazing fast
    • The disk I/O performance of the virtual machines (VMs) is slow. -> no, they are very fast
    • Windows takes a long time to start. -> we experienced this about 2-3 times after many updates

    Only live migration is slow.

    Since MS advises you to only use a hotfix when you have the described situation and we don't even have a Westmere (or I am informed wrong) it seem not to bee a good idea to use this hotfix?

    Friday, April 13, 2012 9:46 PM
  • E5540 is Gainstown which is a Westmere (nehalem-c) variant

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    Check the charts in link.

    J

    Friday, April 13, 2012 10:09 PM
  • E5540 is Gainstown which is a Westmere (nehalem-c) variant

    http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

    Check the charts in link.

    J

    Maybe I am getting this al wrong, but I can't see this in your link or in this link: http://en.wikipedia.org/wiki/Xeon#5500-series_.22Gainestown.22


    5500-series "Gainestown"

    Gainestown
    Produced From 2008 to present
    Max. CPU clock rate 1866 MHz to 3333 MHz
    Min. feature size 45 nm
    Instruction set x86
    Microarchitecture Nehalem
    CPUID code 106Ax
    Product code 80602
    Cores 4
    L2 cache 4x256 KB
    L3 cache 8 MB
    Application DP Server
    Package(s)
    Brand name(s)
    • Xeon 55xx

    Gainestown or Nehalem-EP, the successor to the Xeon Core microarchitecture, is based on the Nehalem microarchitecture and uses the same 45 nm manufacturing methods as Intel's Penryn. The first processor released with the Nehalem microarchitecture is the desktop Intel Core i7, which was released in November 2008. Server processors of the Xeon 55xx range were first supplied to testers in December 2008.<sup class="reference" id="cite_ref-20" style="line-height:1em;">[21]</sup>


    3600/5600-series "Gulftown"

    Gulftown or Westmere-EP, a six-core 32 nm Westmere-based processor, is the basis for the Xeon 36xx and 56xx series and the Core i7-980X. It launched in the first quarter of 2010. The 36xx-series follows the 35xx-series Bloomfield uni-processor model while the 56xx-series follows the 55xx-series Gainestown dual-processor model and both are socket compatible to their predecessors.

    Intel tells me, that my CPU is a 45nm (http://ark.intel.com/products/37104/Intel-Xeon-Processor-E5540-(8M-Cache-2_53-GHz-5_86-GTs-Intel-QPI)) and Westmere must be "Westmere (formerly Nehalem-C) is the name given to the 32 nm die shrink of Nehalem." according to your own link.

    Now I am really confused why my CPU should be a Westmere?


    • Edited by Joaquin72 Friday, April 13, 2012 10:23 PM
    Friday, April 13, 2012 10:23 PM
  • Westmere is a die shrink of Nehalem-C

    Saturday, April 14, 2012 1:45 AM
  • Firguring out which features are on which processor in the Nehalem family (of which Westmere is a subset) can be very confusing.  What isn't confusing is the fact that I built a Fast Track Sumission last year, and part of the blade selection process involved Cisco B200 M1 and M2 blades.  The M1 is 5500 series intel processor based, and the M2 is 5600 series based.  We had this specific problem with both.  This is the link to the NetApp and Cisco Fast Track White Paper:  http://media.netapp.com/documents/wp-7132.pdf


    Saturday, April 14, 2012 2:56 AM
  • Hi,

    Can you provide what NIC's are been used on the servers.


    Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.

    Saturday, April 14, 2012 2:16 PM
  • Hi,

    Can you provide what NIC's are been used on the servers.


    Sanket. J Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.

    Node 1 (main problem): both Nodes are DELL R710 with 1 Intel Gigabit ET Quad Port onboard and one BCM5709C NetXtreme II PCIE Card.
    Nothing is teamed. Live Migration has it's own NIC port (direct connected to each other).
    Monday, April 16, 2012 7:19 AM
  • After installing the fix (and rebooting the server node) live migration was (like always) very fast.

    I waited some days and tried again. Just as slow as before :(
    So the hotfix did nothing for us.

    Thursday, April 19, 2012 11:11 AM
  • Anyone has any other idea?
    Thursday, April 26, 2012 8:16 AM
  • No its even worse:

    live migration of a server with 10GB RAM failed

    'Virtual Machine cemmsrv119 Ex = no online Snapshot' live migration did not succeed at the source.
    Live migration did not succeed: The operation timed out.


    ID 21502
    Source: Hyper-V-High-Availability

    Thursday, April 26, 2012 9:27 AM
  • Had a consultant here that pointed to this post: http://social.technet.microsoft.com/Forums/en-AU/winserverhyperv/thread/a6063ff0-38b9-46ae-8e98-6d017c0c0e75

    Done these things

    Installed Westemere Hotfix on second Node too

    Windows Power Options -> High Performance http://support.microsoft.com/kb/2207548/en-us

    BIOS
    Both BIOS updated
    C states und C1E disabled
    Power Management to "OS Controlled"

    NICs
    all Driver updated
    enabled „virtual machine queues“ on all nics with VM activity (does tcp offload from VM to host)
    Jumbo Frame to 9000 on the CSV network (live migration)
    all NICs: Flow Control & 8 Receive scaling queues, Power Management disabled

    Now I have up to 965MBit on a 1GB NIC while live migrating.
    Let's see if it lasts.

    Friday, April 27, 2012 2:16 PM
  • It did not last, but it's still better than before:

    A 8GB RAM Server now transfers in about 8 min with ~200Mbps.

    This ist still better than I had started with, but 1/5 of the tests directly after restarting the servers.

    Monday, May 7, 2012 1:37 PM
  • You might try this hotfix as well.

    http://support.microsoft.com/kb/2675785

    J

    Monday, May 7, 2012 7:14 PM