none
Live Migration on 10GbE only 16%

    Question

  • Hi,

    I’m having a weird and annoying issue with my 3 node HYper-V R2 SP1 Hyper-V cluster. Live migrations of VMs are successful (only 1 ping lost) but only uses a max of 16% of the 10GbE Intel(R) Ethernet Server Adapter X520-2 NIC that is dedicated to the Live Migration Network.

    The cluster has the following setup:

    ·         Windows Server 2008 R2 SP1 with KB25173 http://support.microsoft.com/kb/2517329

    ·         3 x Dell R810 (with most current BIOS 2.2.3)

    ·         CPU: 2 x Intel E7 – 2860 ~ 2,27Ghz, 10 Core

    ·         RAM: 256 Gb

    ·         Local disk: 2 x 146GB SAS 6Gbps 15k in RAID 1

    ·         NICs : Onboard 4 x Broadcom BCM5709C NetXtreme II GigE, Broadcom driver 14.4.8.4, 6 x Intel(R) Ethernet Server Adapter X520-2 Intel driver 16.5

    ·         Page file: 4 - 6GB

    ·         Switchs: 4 x Dell PowerConnect 8024F

    I have tried numbers of thinks like adjusting Jumbo Frames, RSS, TOE, receive and send buffers, different drivers, all on one switch, small VMs (2GB RAM), large VMs (32GB RAM). Nothing seems to work. Cluster network metrics, binding orders and bindings are set as it should. A different NIC on a different card in a different riser also does not seem to help.

    It looks like something is capping the network throughput. There are no QoS or IPSec policies applied. Switch and NIC are configured to auto negotiated speed. A simple file copy of 40Gb over the LM network has a throughput of about 60-70%. I know file copy is not teh same a LM, but it is a simple check if there is a network bottleneck. LM network is seperated by using VLANS.

    Is there a Live Migration (memory-cpu-nic) issue with the Intel E7 – 2860 ~ 2,27Ghz (Westmere)?

    Anyone an idea?

     

     


    Just a guy who thanks Bill for paying his bills! :-)
    Wednesday, September 21, 2011 10:12 AM

Answers

  • Martius,

     

    I once had a similar problem. I was using a poor man's 1 GB network for my live migration. File copies (SMB) were quick, using 100% of the capacity, but livemigration was stuck at 70%. After some tweaking I found out that disabling the C-states in the Bios helped me. after I disabled them, my livemigration was using 99+% of my gigabit network.

    A small difference is that I was using HP Proliant 380g7 instead of Dell. CPU was Xeon X5660.

     

     

    Wednesday, September 21, 2011 1:40 PM

All replies

  • Martius,

     

    I once had a similar problem. I was using a poor man's 1 GB network for my live migration. File copies (SMB) were quick, using 100% of the capacity, but livemigration was stuck at 70%. After some tweaking I found out that disabling the C-states in the Bios helped me. after I disabled them, my livemigration was using 99+% of my gigabit network.

    A small difference is that I was using HP Proliant 380g7 instead of Dell. CPU was Xeon X5660.

     

     

    Wednesday, September 21, 2011 1:40 PM
  • Disabling the C-states in the bios indeed solved the slow live migration issue. From 16% to 60-65% for small VM's (2GB RAM), 60% for medium VMs (8GB RAM) and 40% for larger VMs (32 GB)

    Thanks!


    Just a guy who thanks Bill for paying his bills! :-)
    Wednesday, September 21, 2011 2:00 PM
  • That's what we see. On average evacuating an entire host is at 50%.  Our largest memory VM are right on the 50% mark. Example we can evacuate a 50 GB SQL Server VM in 94-97 seconds at 50%. good things we have room left for failover of CSV NIC and for simultanious Live Migrations with Windows 8 server :-)

    We got at minimum 20% and on average 35% to 40% with all power options set to max perfomance but disabling C state helps (and not only here also see http://workinghardinit.wordpress.com/2011/06/20/consider-cpu-power-optimization-versus-performance-when-virtualizing/)

    Jumbo frames does help in our case but receive/send buffer does nothing but drag the speed back down to before jumbo frame optimization. TOE does neither good or bad.

    Nice info for you: Same switches, same NICs & same server model.

    HTH

    Didier Van Hoye

    http://workinghardinit.wordpress.com

     

     

    Wednesday, September 21, 2011 3:14 PM
  • I work at Dell on a Hyper-V Solutions team and we've noted this exact scenario on the C-States...  Disable them to improve performance of Live Migration. 1GbE and 10GbE have shown similar results. 

    We have also seen a bump in the performance of Live Migrations by setting the Power Management settings in the BIOS to Max Performance.  It is of course at the expense of power consumption. Unless power is a major concern I would generally recommend setting all 3 (C, C1, and Power Management) to improve performance.

    And unless you've got a physical network controller port dedicated to LM, I would seriously consider capping the throughput of the LM using a group policy. I'd hate to see a LM stomp on the available bandwidth of a real workload.

    -Brian


    • Edited by gautreau Wednesday, September 21, 2011 9:58 PM spelling
    Wednesday, September 21, 2011 3:47 PM
  • Good to see & hear. What's the top bandwith use you guys get out of your systems. And what's the average. Don't be shy to publish some whitepapers on this ;-)
    Wednesday, September 21, 2011 7:43 PM
  • For the record here the results:

    Don't be fooled by the scale of the graphs.

    The first pictures has a scale of 0-25% and displayes the Live Migration of 2 VMs eacht 32GB of RAM. Each Live Migration of the VMs took about 4 minutes each, total about 8 minutes.

    The second images shows 8 VMs ranging from 2 GB of RAM to 32 GB of RAM.

    And finaly the third images shows the same VMs as the second graph but then with jumbo frames enabled. All 8 VM's (about 124 GB of RAM) where Live Migrated in about 2,5min! Pretty cool!

     


    Just a guy who thanks Bill for paying his bills! :-)
    Thursday, September 22, 2011 11:15 AM
  • I see the same with HP DL180s on 10Ge.

     

    J

     

    Thursday, September 22, 2011 5:34 PM
  • Martius,

     

    I once had a similar problem. I was using a poor man's 1 GB network for my live migration. File copies (SMB) were quick, using 100% of the capacity, but livemigration was stuck at 70%. After some tweaking I found out that disabling the C-states in the Bios helped me. after I disabled them, my livemigration was using 99+% of my gigabit network.

    A small difference is that I was using HP Proliant 380g7 instead of Dell. CPU was Xeon X5660.

     

     


    I had the same issue with HP Proliant DL380G7 servers. Disabling the C-state in the bios solved the problem.

    Live Migration now uses the full 100% bandwith. Thx alot :).

     

    I still have a problem with teaming using two 1gbit nics. The 2 nics are teamed with HP software in SLB.

    All cables are in one gigabit switch.

    When I start a large file copy from one server to the other one, the total bandwith is 1gbit (512mbit per NIC) not 2gbit (1gbit per NIC).

    Does anyone know I need to configure the nic's to send en receive with 2gbit?

    Or is it not possible to send and receive with 2gbit because my switch (HP Procurve V1910) can only operate at a 1Gbit speed?

     

     

    Monday, September 26, 2011 12:46 PM
  • Rogier,

    Let me start by saying, your teaming questions are probably best answered by an HP forum. But I'll give you my thoughts...  

    It sounds like a session, mac or ip hash based load balancing algorithm on your team. During the file copy, since only one session is established, to one mac or ip address, the traffic is only bound to one interface thereby limiting you to 1Gb of throughput. Depending on your network architecture, you could possibly use an 802.3ad team... but it is still better answered by an HP forum.

    --bg

    Monday, September 26, 2011 1:31 PM
  • I had a very slow live migration too. Details are explained here: http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/2c9394c7-e618-4a0c-a7e9-fedd0166c63a/

    S
    olution (Server 2008 R2, 2 nodes)

    Westmere Hotfix
    ) Installed on both Nodes http://support.microsoft.com/kb/2517329


    Windows Power Options
    ) High Performance 
    http://support.microsoft.com/kb/2207548/en-us


    BIOS
    ) Both BIOS updated
    ) C states und C1E disabled
    ) Power Management to "OS Controlled"


    NICs
    ) all Driver updated
    ) enabled „virtual machine queues“ on all nics with VM activity (does tcp offload from VM to host)
    ) Jumbo Frame to 9000 on the CSV network (live migration)
    ) all NICs: Flow Control & 8 Receive scaling queues, Power Management disabled

    Now I have up to 965MBit on a 1GB NIC while live migrating!
    Let's see if it lasts.

    • Proposed as answer by Korzh Ilya Friday, May 31, 2013 3:07 PM
    Friday, April 27, 2012 2:20 PM
  • Westmere fix and disable deep c-states.  I also came across http://support.microsoft.com/default.aspx?scid=kb;EN-US;2564236 which may also help in some cases.

    J

    Friday, April 27, 2012 7:59 PM
  • It did not last, but it's still better than before:

    A 8GB RAM Server now transfers in about 8 min with ~200Mbps.

    This ist still better than I had started with, but 1/5 of the tests directly after restarting the servers.

    Monday, May 07, 2012 1:36 PM
  • Why don't you put a link to the other thread you are posting into this one and combine the two?  It gets confusing.  There are also two hotfixes out for slow file copy.  I posted one of the links in your *other thread.

    J

    Tuesday, May 08, 2012 1:08 AM