none
Why use vNICs for a converged fabric design with Hyper-V?

    Question

  • Hi all.

    When it comes to designing converged fabrics for Hyper-V the material I've seen so far recommends to create a LBFO team for everything besides storage.

    To this team we connect a virtual switch and then connect virtual NICs for Management, Cluster/CSV, Live Migration and what have you.

    My understanding is that this is so that we can manage the bandwidth properly.

    This leads me to a question though. Why not create team NICs for Management, Cluster/CSV, Live Migration and so on instead?

    We can set a QoS policy on these as well (unless I'm misreading the documentation for New-NetQosPolicy completely) and we get the benefit of RSS and avoid the limitations of VMQ.

    The virtual switch is then connected to its own tNIC.

    Granted, it's late friday afternoon and I've been reading and writing about this the entire day so I might've just confused everything horribly but it seems to me that my suggestion should work.

    Any thoughts and ideas?

    Friday, October 04, 2013 3:26 PM

Answers

  • I did some testing on my systems.

    First, I did a basic Live Migration across vNICs. The network topped out at 5.5 Gbps throughput and was fairly consistent. Lower than I expected but above the article's indicated 3.5 Gbps. I then reconfigured Live Migration to use tNICs. That topped out at 8.0 Gbps but was very inconsistent, dipping down to the 4 Gbps range and jerking up to the 7 Gbps range very often. But, I timed both of them and the tNIC transfer is faster overall. A VM with 48GB of RAM took about 1 minute 4 seconds to transfer across the tNICs and about 1 minute 20 seconds over the vNICs.

    But, the more important thing for me is the claim that VMQ makes all incoming traffic use a single core, which has not been my understanding of how VMQ works. So, I started a Live Migration over dedicated Live Migration vNICs and a file copy over the Management vNICs, same host pair. Two cores ran at 100% on the receive host during the respective transfers. This would seem to contradict the statement in the blog post: "The downside of VMQ is that the host and every guest on that system is now limited to a single queue and therefore one CPU to do their network processing in the host." I realized that I had only been watching CPU usage, so I tried to duplicate the test to watch for network usage as well. The network usage was capped at 5.5 Gbps, the way he said. However, the CPU behavior was different. There were still two being used, but they were zigzagging between 0% and 100% in an alternating between the two cores. I'm assuming that there were two transmissions going but they were being placed on the same pNIC. I tried again a few times, but never got the results of the same test again. But, it doesn't appear that VMQ really forces a single core in any case. Maybe once the queues are exhausted, the remaining traffic is dumped on a single core. My manufacturer doesn't publish how many VMQs their adapters support so I don't know of a reliable way to test.

    Personally, I don't care enough about speed to stop using vNICs. The only thing that will be affected on our systems is Live Migration and we just don't do enough for it to matter. Our systems use Fibre Channel for storage so there won't be any contention between that and the network. Of course, I'll dig down and see if there aren't some switches to flip or knobs to turn to speed up vNICs so they behave more like the tNICs.

    If you are going to use tNICs, then I would recommend that you create your virtual switches with "-MinimumBandwidthMode None" to ensure you don't have conflicting QoS. In its default configuration, it does use QoS. I can't say for sure whether it would actually conflict or not, but no sense tempting fate.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Monday, October 07, 2013 3:11 PM
  • You can configure it any way that you want to. Nothing that you're reading represents a required configuration.

    The issue with dividing up your pNICs into discrete teams is bandwidth utilization. Sure, you'll get RSS on that Cluster/CSV team, but those NICs are going to be <1% utilization 99% of the time. The Management team will spike when you're copying files to/from the host or performing hypervisor-level backups, but the rest of the time it will sit at <1% utilization. If you're using 2012 in a 2-node cluster, one pNIC in the Live Migration network will be tapped out during a Live Migration while the other sits idle (2012 R2 can use multiple paths). The rest of the time, they'll both sit idle.

    So, you've got a team left over for your VM's vNICs. Depending on how many VMs you have and their traffic patterns, they may want more bandwidth than your small virtual switch team has available. In the meantime, you've got two or more pNICs sitting idle in other teams. Then again, your VMs may be more than satisfied with what they've got, so it may not matter. Another consideration is that if you combine them all into a single converged network, a Live Migration may get placed on the same pNIC as another bandwidth-intensive operation, causing both of them to slow down.

    Unfortunately, there is no "right" answer.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Friday, October 04, 2013 7:35 PM
  • I cleared all QoS that I could for the test. I also evacuated the nodes that were being tested. There should have been only minimal interference. I can try it again at some point.

    The driver for these adapters indicates that they support 16 VMQs. If I get the opportunity, I'll try to build a test for them. I think I have an approach in mind, but not sure when, or even if, I'll be able to get to it.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Tuesday, October 08, 2013 12:59 PM

All replies

  • You didn't mention what you have been reading, so it is a little difficult to comment on the documented recommendations, or to determine why the person might have made that recommendation.  But you are absolutely correct - you can create multiple teams and define QoS on them.  On reason the person may have been suggesting a single large team is number of NICs available on a server.  If you cannot put a lot of physical NICs into a server, it's pretty hard to create a lot teams.  It also means you have more NICs to manage and a higher probability of one failing (law of averages).

    My guess is that the documentation was making recommendations based on hardware they were working with.  I work with a totally converged environment allowing me as many NICs as I want.  Others may have slot limits on their servers which limits the number teams they could create.


    .:|:.:|:. tim

    Friday, October 04, 2013 7:26 PM
  • You can configure it any way that you want to. Nothing that you're reading represents a required configuration.

    The issue with dividing up your pNICs into discrete teams is bandwidth utilization. Sure, you'll get RSS on that Cluster/CSV team, but those NICs are going to be <1% utilization 99% of the time. The Management team will spike when you're copying files to/from the host or performing hypervisor-level backups, but the rest of the time it will sit at <1% utilization. If you're using 2012 in a 2-node cluster, one pNIC in the Live Migration network will be tapped out during a Live Migration while the other sits idle (2012 R2 can use multiple paths). The rest of the time, they'll both sit idle.

    So, you've got a team left over for your VM's vNICs. Depending on how many VMs you have and their traffic patterns, they may want more bandwidth than your small virtual switch team has available. In the meantime, you've got two or more pNICs sitting idle in other teams. Then again, your VMs may be more than satisfied with what they've got, so it may not matter. Another consideration is that if you combine them all into a single converged network, a Live Migration may get placed on the same pNIC as another bandwidth-intensive operation, causing both of them to slow down.

    Unfortunately, there is no "right" answer.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Friday, October 04, 2013 7:35 PM
  • Thanks for the replies, much appreciated.

    The material I've been reading has been from various non-MS blogs (Hyper-V.nu is one that comes to mind) and also accumulated over time from numerous Microsoft blogs. Unfortunately I can't point to one specific source.

    Allow me to elaborate on my original post though as it seems I was just as unclear as I feared. :)

    I'm not talking about creating one team per type of traffic (LM, Cluster/CSV and so on), or using one pNIC per type of traffic. I'm talking about creating a team and then carving it up into one team NIC (tNIC) per type of traffic.

    For example: I create a team out of 2x10 GBit pNICs using LACP and Hyper-V port load balancing.

    From this team I then create one tNIC per type of traffic; Live Migration, Cluster/CSV, Management, Backup (if network based) and VMs.

    Using New-NetQoSPolicy I set appropriate QoS on these tNICs, including minimum guaranteed bandwidth, in order to utilize all the available team bandwidth.

    I connect a vSwitch to the tNIC named "VMs".

    Is this possible to do (I haven't had time to test in my lab) and if so; what would the potential drawbacks of this setup be?

    Thanks.


    • Edited by Martin Edelius (Atea) Saturday, October 05, 2013 12:20 PM Used the wrong load balancing method for the team in my example.
    Saturday, October 05, 2013 8:43 AM
  • If you're going to create a virtual switch on a team, then don't create additional adapters on the team beyond the one that will host the virtual switch. Make the adapters, but on the virtual switch, not on the team. Then use Set-VMNetworkAdapter for their QoS instead of Windows QoS.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Sunday, October 06, 2013 2:51 AM
  • But why?

    What are the benefits of this approach?

    Sunday, October 06, 2013 5:50 AM
  • The first reason is just that you gain nothing by mixing them together. It's more complicated to build, monitor, and document with no return on your effort.

    The second is that you'll have two different QoS systems running simultaneously and they're not terribly aware of each other. The virtual switch's QoS operates at the hypervisor level, which is above your management operating system. The QoS on the team exists within the management operating system only. To be honest, I'm not certain just how much authority it can exert over the virtual switch.

    With 10G cards, I don't even know how much QoS is worth your effort. We're just in the early phases of our roll-out and the physical design is like yours. So far, I'm absolutely certain we spent way more time worrying about QoS than it was worth.

    I recall reading a couple of early articles from MVPs that were very clear that these should not be mixed, but I don't remember their explanations. Of course, I can't find those articles anymore.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Monday, October 07, 2013 3:39 AM
  • I'm not using any QoS on the virtual level, only on the tNICs. So the virtual switch, and thus the VMs, gets all the bandwidth that the tNIC has.

    As for why I want to use QoS on the physical level, or tNIC level in my case, and not in the virtual switch is because of the limitations in VMQ.

    In a recent blog post Gabriel Silva points out that VMQ, due to it being bound to one CPU core, is limited to a throughput of roughly 3.5 Gbit/second with current hardware.

    I know of one large customer that has run into this limitation when doing backups and Live Migration through a virtual switch.

    It is in other words not a problem to saturate a 10 Gbit infrastructure, and even less so if you start using SMB for storage for a true converged fabric.

    Without having done any testing with 2012 R2 I can't say how feasible it is to use the same team for LAN and SAN though but in 2012 RTM I'd separate the LAN and the SAN.

    But that's a discussion for another time. :)

    I'll try and do some lab tests with my design to see how everything fits together.

    Monday, October 07, 2013 5:43 AM
  • I did some testing on my systems.

    First, I did a basic Live Migration across vNICs. The network topped out at 5.5 Gbps throughput and was fairly consistent. Lower than I expected but above the article's indicated 3.5 Gbps. I then reconfigured Live Migration to use tNICs. That topped out at 8.0 Gbps but was very inconsistent, dipping down to the 4 Gbps range and jerking up to the 7 Gbps range very often. But, I timed both of them and the tNIC transfer is faster overall. A VM with 48GB of RAM took about 1 minute 4 seconds to transfer across the tNICs and about 1 minute 20 seconds over the vNICs.

    But, the more important thing for me is the claim that VMQ makes all incoming traffic use a single core, which has not been my understanding of how VMQ works. So, I started a Live Migration over dedicated Live Migration vNICs and a file copy over the Management vNICs, same host pair. Two cores ran at 100% on the receive host during the respective transfers. This would seem to contradict the statement in the blog post: "The downside of VMQ is that the host and every guest on that system is now limited to a single queue and therefore one CPU to do their network processing in the host." I realized that I had only been watching CPU usage, so I tried to duplicate the test to watch for network usage as well. The network usage was capped at 5.5 Gbps, the way he said. However, the CPU behavior was different. There were still two being used, but they were zigzagging between 0% and 100% in an alternating between the two cores. I'm assuming that there were two transmissions going but they were being placed on the same pNIC. I tried again a few times, but never got the results of the same test again. But, it doesn't appear that VMQ really forces a single core in any case. Maybe once the queues are exhausted, the remaining traffic is dumped on a single core. My manufacturer doesn't publish how many VMQs their adapters support so I don't know of a reliable way to test.

    Personally, I don't care enough about speed to stop using vNICs. The only thing that will be affected on our systems is Live Migration and we just don't do enough for it to matter. Our systems use Fibre Channel for storage so there won't be any contention between that and the network. Of course, I'll dig down and see if there aren't some switches to flip or knobs to turn to speed up vNICs so they behave more like the tNICs.

    If you are going to use tNICs, then I would recommend that you create your virtual switches with "-MinimumBandwidthMode None" to ensure you don't have conflicting QoS. In its default configuration, it does use QoS. I can't say for sure whether it would actually conflict or not, but no sense tempting fate.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Monday, October 07, 2013 3:11 PM
  • Excellent testing Eric, very much appreciated.

    I'm also surprised at the 5.5 Gbit/second throughput, especially as I know of a customer who hits a 3 Gbit/second limit.

    I now realise that I mixed up core and CPU in my previous post, apologies for that.

    Did you apply any QoS policies on the tNICs? If not, then perhaps that could help you shape the bandwidth to something more consistent.

    As for how many VMQs an adapter supports, I only know of two from HP. The 560SFP supports 64 VMQs and the 331FLR supports 16 VMQs.

    It's my understanding (I'm going off information given to me) that this is dictated by the chipset so any adapter using, for instance, the BCM5719 controller from Broadcom (the one in HPs 331FLR) should support 16 VMQs.

    If you have the time to test this even more I'd love to hear the results but as I said, I'm still very grateful for your feedback.

    Tuesday, October 08, 2013 6:49 AM
  • I cleared all QoS that I could for the test. I also evacuated the nodes that were being tested. There should have been only minimal interference. I can try it again at some point.

    The driver for these adapters indicates that they support 16 VMQs. If I get the opportunity, I'll try to build a test for them. I think I have an approach in mind, but not sure when, or even if, I'll be able to get to it.


    Eric Siron Altaro Hyper-V Blog
    I am an independent blog contributor, not an Altaro employee. I am solely responsible for the content of my posts.
    "Every relationship you have is in worse shape than you think."

    Tuesday, October 08, 2013 12:59 PM
  • Hi Eric.

    I'm having a conversation over e-mail with Microsoft regarding this but due to time constraints the whole thing has sort of dragged on a bit.

    One answer i got from Gabriel Silva was that it is in fact possible to se the network speeds you reached on a single core so I'm guessing that the numbers in his article was on an average server CPU and not a high-end CPU.

    Other than that I'm still hoping for some clarifications and there's also a possibility that I can get hold of a proper 10 Gbit lab.

    When/if this happens I'll blog about my findings.

    Thanks again for your help.

    Thursday, October 31, 2013 1:59 PM