none
SMB 3.0(WS2012) vs. iSCSI(nexentastor) I/O performance

    Question

  • Hi there,

    i recently set up a virtual SAN (nexentastor) on my Workstation (i3-2100,24GB RAM, Intel 320 SSD) as iSCSI target for my Hyper-V environment. Nexentastor uses one 250Gb 7k2 Sata HDD as storage and one Corsair X32 SSD as ZIL (although its not used, since 16GB RAM is enough for ARC and write cache).

    The virtual SAN is connected via 1Gbit to the Hyper-V Server (Server 2012 RC with role enabled). iSCSI initator does the rest.

    First tests show 50mb/s r/w with 4kb random (12500 I/O per second), which makes the VM's (WS2012essentials beta and WS2012RC with Exchange2013) responsive as... just as i like it :P

    My question is, will there be any similar performance (I/O) when i use WS2012 as host for the VM disks? I heard a lot about SMB 3.0 performance increase, especially with hyper-v disk access. On the other side, threads like Slow SMB3 and iSCSI as Hyper-V VM storage because of unbuffered I/O show bad performance for this case.


    • Edited by BlackLiner Thursday, August 30, 2012 12:39 PM
    Thursday, August 30, 2012 12:38 PM

All replies

  • You have an easy way to check. Please build a virtual machine with Server 2012 and try it. 

    Don't forget to report back! I'm really curious to hear how it performs on the same hardware.

    I expect a decrease in performance because you'll be missing the 16GB write cache. SMB 3 isn't faster than the underlying hardware.

    Thursday, August 30, 2012 2:00 PM
  • I'm using SMB3 with RDMA (I had to upgrade my Infiniband cards to get RDMA support) and I'm seeing about 3GB transfers over SMB3. That's with my cards running at 1/2 speed (still need to work on that). I'm running 24GBs on the SMB server so I'm sure that helped. When I did a ATTO benchmark in a VM over SMB3, I actually saw a performance increase over running the benchmark locally, so there is a nice bump in speed via SMB3. 

    Of course this was with a striped set.. now I'm trying to work out the parity/mirroring bottleneck. 
    • Edited by Dustyny1 Thursday, August 30, 2012 2:14 PM
    Thursday, August 30, 2012 2:13 PM
  • @hans: thats exactly what i am going to do later on. Any benchamrk tools prefered, using CrystalDiskMark, ATTO, ASSSD or iometer.
    Thursday, August 30, 2012 2:17 PM
  • I suspect your limit to 3GB/sec using your 54Gb Infiniband/RDMA cards may be because you only have PCIe 2.0 x8 in your server. You'll need a Romley chipset to get PCIe 3.0 x8, which will allow you to get to 6GB/sec.

    I'm looking into doing this for a high-performance SQL cluster, to get HA. I already get 8GB/sec using many local DAS arrays and many controllers on my DL980 G7, but I need to move to HA.

    See here at Jose Barreto's blog: http://blogs.technet.com/b/josebda/archive/2012/07/31/deploying-windows-server-2012-with-smb-direct-smb-over-rdma-and-the-mellanox-connectx-2-connectx-3-using-infiniband-step-by-step.aspx

    Please note that you will need a system with PCIe Gen3 slots to achieve the rated speed in this card. These slots are available on newer system like the ones equipped with an Intel Romley motherboard. If you use an older system, the card will be limited by the speed of the older PCIe Gen2 bus.

    • Proposed as answer by Dustyny1 Thursday, August 30, 2012 6:55 PM
    • Unproposed as answer by Dustyny1 Thursday, August 30, 2012 6:56 PM
    Thursday, August 30, 2012 3:38 PM
  • You are absolutely correct, I just realized what's going on, thank you.

    When I use netperf between the node and the SMB server I see transfer rates of 2GBs, something I figure since I was 1GB below the PCIe bus limitation I could tuned some network settings and get a bit closer to the 3GB limitation.. now I'm thinking that could actually be the overhead that RDMA is meant to address. 

    Though the storage system on my SMB server can read and rewrite from the disks at about 2.5GBs, when I ran a the benchmark in the VM I actually got a 500MB bump to 3GBs. So it looks to me like there is some sort of read/write caching going on but I'm not 100% where this is taking place. 

    Since I'm getting limited by the PCIe bus to 3GBs, I bet I could figure out how much of boost I'm getting by using just one disk instead of a stripe set.  If I take one disk use that as a baseline benchmark it locally and then across SMB (inside a VM) then the difference should be the caching benefit. VM benchmark - local benchmark = cache benefit. I'll try to do some testing once I get my storage sorted. 

    I'm pretty amazed by how much they've improved SMB, to me they've completely destroyed the need for iSCSI or NFS. It's fast as hell and has 0% learning curve, it just works. 

    • Edited by Dustyny1 Thursday, August 30, 2012 6:55 PM
    Thursday, August 30, 2012 6:42 PM
  • I'd be interested in any performance results you have with IB cards. Are you using Mellanox Connectx? I don't have any experience of IB, just what I've read.

    Also, how does the IB network connection appear in the Hyper-V VM? As a Hyper-V network adapter (limit 10Gb) or something else?

    I'm investigating the following setup (kind of like cluster-in-a-box, but a much more scaled up custom version for SQL):

    * 2x DL980 G7 hosts, each with 4+ cluster-aware HA RAID SAS controllers. These should be coming out soon, from LSI: http://www.lsi.com/solutions/Pages/HA-DAS.aspx. Presumably HP and others will OEM them.

    * 16+ dual-port SAS enclosures, connected to each server. 400+ 10k SAS disks shared between the two servers.

    * SOFS running in 2 VMs on the two hosts. Need SOFS to run in VMs, due to the loopback issue (app data can't loop back to a SOFS running on the same server, but you can get around this by putting SOFS in VMs - see http://technet.microsoft.com/en-us/library/jj134181 where it says "Accessing a continuously available file share as a loopback share is not supported. For example, Microsoft SQL Server or Hyper-V storing their data files on SMB file shares must run on computers that are not a member of the file server cluster for the SMB file shares")

    * A set of SQL clusters running directly on the two hosts. All SQL data stored on the SOFS SMB3 share for HA/failover. Normally different SQL instances running on the different nodes.

    Ideally each SQL instance would only normally connect to the SOFS VM running on the same host. I don't know if can force this or express a preference for a SOFS node, as the connection seems to be controlled by round-robin DNS. That way I might not actually need IB/RDMA cards, as the primary connection would be internal using the Hyper-V network adapter. I'm not sure if this would be limited to 10Gb (per the Hyper-V network adapter), or could actually go faster (as it is all internal to the host hardware).

    Rebooting the local SOFS VM would temporarily redirect traffic to the other node over GbE - you'd want to keep this period short. Failure of a host wouldn't be a problem, as both the SQL instance and the SOFS would fail over to the other host, so all SQL to SOFS data transfer would be kept local on that other machine.

    But I may yet need IB cards (either directly connected, or via 1-2 IB switches) to make this work. Any thoughts?


    Thursday, August 30, 2012 7:05 PM
  • Local storage with H.A/Cluster is the holy grail but from my experience we might not be there yet. I didn't use a shared SAS enclosure so that might be what I was missing, I'd love to hear your experience when you do your testing.  

    I'm using the Mellanox ConnectX 2 40Gb dual port cards , I got them on eBay for a really reasonable price especially considering what they do. Setup was really simple (in windows, linux is not fun and unix is a nightmare), the only downside is the vendor makes you pay for a support contract to get any support. I had to freak out on a tech just to get them to tell me the card I originally purchased wasn't supported in Windows 2012.. BUT with that in mind, no one else comes anywhere close to them in price. 

    I actually just spent the last 4 months testing something similar. I originally intended on running 2 SANs in VMs (1 per node) passing-through the storage controller and nic through to the VM in KVM and then using the H.A features to keep the two SANs in sync. That's what lead me to the purchase of the IB cards, because I was being bottlenecked by the 1gb nics that i had (10Gb is too $$ for what you get) so storage performance was limited to 100mb. KVM is so labor intensive, I just couldn't see putting it in production. 

    I tried a ton of different configs of software and hardware, I tried just about every software SAN I could find and I just couldn't get the storage to deliver usable read/write speeds. So I finally gave in and added the 3rd server which I'm now using as the SMB server. 

    You'll want to do you own testing but for me the virtual nic was the problem, I'm not sure if that's a limitation of my config or not so maybe you'll see something different. IN KVM there is a driver that gives the virtual machine a direct access to the host OS and it's very fast but in Hyper-V the vNic seemed to be limited to 10gb. I tried bridging some virtual nics but that didn't solve the problem either (I forgot if it didn't work or if it didn't go beyond 10gb). I'm going to be dealing with a lot of large disk transfers and I didn't want to lose out on the speed of the SSDs and the IB cards but if 1Gb works for you then you should be fine. 

    I think PCie pass-through would be the solution to your problem but unfortunately hyper-v doesn't support it. If you're good with Linux and you get stuck it might be worth checking out KVM but you'll give up a lot of polish and the documentation is pretty terrible IMO. 

    I did notice earlier today that the Mellanox cards can do FCoE which I was considering experimenting with, I thought maybe I could use that and get at least 2GBs, that wouldn't be so bad. 

    Thursday, August 30, 2012 11:53 PM
  • Thanks for info. You say you found that the Hyper-V virtual NIC limited you to 10Gb. But you said earlier you get 3GB when running from a VM on one host to another over IB. So how did you configure the virtual NICs?

    Perhaps you can configure multiple virtual NICs to the VM and use SMB3 multi-channel with different IP addresses, or alternatively a guest team, to get more than 10Gb? You said you tried bridging but it didn't work for you. But it seems Jose Barreto was getting 5+GB/sec using 2x 54Gb IB cards to his VMs, which dropped to 3GB when one was disconnected.

    I need to stick with Hyper-V (vs. KVM or some other virtualization layer) as we run SQL Server and a few other things in our hosts. Even 32 vCPUs and 512GB to a VM with Server 2012 / Hyper-V 3 isn't enough when you have a DL980 G7 with 1TB and 64 cores - principally for SQL Server.

    Did you try Starwind for software HA iSCSI SAN on two servers with local DAS? They have some benchmark that shows 1 million+ IOPS, so it seems to scale. I use it for HA/DR shared storage across two data centers for my VM storage - direct MPIO iSCSI for the 2 VMs that provide the SOFS, which then serve up their CSV storage as a SOFS SMB3 share for all other VMs.
    Friday, August 31, 2012 12:17 AM
  • Hi there,

    i recently set up a virtual SAN (nexentastor) on my Workstation (i3-2100,24GB RAM, Intel 320 SSD) as iSCSI target for my Hyper-V environment. Nexentastor uses one 250Gb 7k2 Sata HDD as storage and one Corsair X32 SSD as ZIL (although its not used, since 16GB RAM is enough for ARC and write cache).

    The virtual SAN is connected via 1Gbit to the Hyper-V Server (Server 2012 RC with role enabled). iSCSI initator does the rest.

    First tests show 50mb/s r/w with 4kb random (12500 I/O per second), which makes the VM's (WS2012essentials beta and WS2012RC with Exchange2013) responsive as... just as i like it :P

    My question is, will there be any similar performance (I/O) when i use WS2012 as host for the VM disks? I heard a lot about SMB 3.0 performance increase, especially with hyper-v disk access. On the other side, threads like Slow SMB3 and iSCSI as Hyper-V VM storage because of unbuffered I/O show bad performance for this case.


    It's a very bad idea to run storage hypervisor software remotely. For a reason - you put very fast caching behing slow network. So find a solution clustering DAS and converting into SAN directly on Hyper-V w/o any external hardware - that's a way to go with a performance. Better with an ability to do SSD caching (not a big deal if no as every PCIe flash card vendor has own stuff). Everything else is half ass solution - network latency kills everything.

    -nismo

    • Proposed as answer by VR38DETTMVP Friday, August 31, 2012 6:02 AM
    Friday, August 31, 2012 6:02 AM

  • I'm pretty amazed by how much they've improved SMB, to me they've completely destroyed the need for iSCSI or NFS. It's fast as hell and has 0% learning curve, it just works. 

    I don't think so...

    1) SOFS require shared storage to run. You cannot cluster them AS IS. iSCSI and NFS can be clustered w/o it (cheaper and faster as you have shorter I/O route).

    2) SMB does not allow to split requests between multiple servers (no load balancing). MPIO with iSCSI and pNFS with NFS do it. 

    3) You still need iSCSI to run guest VM cluster.

    So MS killed iSCSI and NFS for "test and development" but not for production. IMHO of course.

    -nismo

    Friday, August 31, 2012 6:06 AM
  • I tried a ton of different configs of software and hardware, I tried just about every software SAN I could find and I just couldn't get the storage to deliver usable read/write speeds. So I finally gave in and added the 3rd server which I'm now using as the SMB server. 


    Did you try Native SAN for Hyper-V? You should have DAS only (and cheap DAS, no need to go for multi-port SAS), only 2 nodes for all-redundant config (you have three which is more expensive and you still have SMB server as a single point of faiure). Just make sure you put enough of RAM cache on both nodes to "spoof" I/O. Your feedback and flash related config (do you want flash cache as a level 2 in addition to RAM being used as a level 1) is appreciated.

    -nismo

    Friday, August 31, 2012 6:10 AM
  • 1 - true

    2 - true, particularly so, given the multiple options for MPIO (preferred nodes, etc. - helpful when you are running a cross-site storage target). I don't see any options to prefer a particular SMB3 node for a particular client - it just uses random round robin. Preferred nodes would be very helpful with differences in network quality (bandwidth, latency, congestion, charging, etc.) across the cluster.

    3 - I'm not sure about. I think you can set up a two-node cluster (in a host or in a guest) with no shared storage, and just use a file share as the witness instead of shared storage. See http://www.sqlskills.com/blogs/jonathan/post/Failover-Clustering-without-a-SAN-SQL-Server-2012-and-SMB-for-Shared-Storage.aspx. If this is an SMB3 SOFS file share, you still have HA.

    However, there are several reasons you might want to have a SOFS SMB3 share hosting your production VMs instead of an iSCSI target.

    * Easier setup of new cluster nodes (don't need to install MPIO iSCSI client and set up connections for each cluster node)

    * Live migration for VM storage mobility, to/from non-shared storage and other SOFS targets

    * Space efficiency across multiple clusters - don't need to dedicate separate fixed-size shared disk targets to each cluster

    * Ability to use differencing VHDXs with a base Server 2012 VM image in a single VHDX, to save storage across clusters

    Of course, I still like Starwind to provide the HA software iSCSI target that the SOFS CSV sits on top of. It just seems more convenient to serve it up as a SOFS to other VMs, instead of iSCSI. It seems there are pros and cons.

    Friday, August 31, 2012 9:35 AM
  • @David 

    The 3GB transfer speed was from a Hyper-V node to a SMB server, there were no VMs involved. I followed Jose's blog post regarding using the ConnectX 2 cards but that's to host the VMs not to transfer data in and out of them. I actually misspoke about bridging vNics in hyper-v, I tested it a while ago so I forgot why I decided it wasn't right for my setup. I just created 2 v-Switches (internal) bridged the vNics in the host and did the same for the VM and did a netperf and saw  20Gbs transfer. So bridged connections are a go, very easy to do. Now that I've looked through my notes the reason why we chose to not use this was due to SAN software costs, we felt it would be better for our setup to put the money in to hardware because we really don't need most of the SAN features anyway. 

    @nismo 

    Hey Nismo, always a pleasure to get your input. 

    Running SANs in the VM wasn't a good option for me but my project has very specific needs and limitations. I always try to be clear that my results and experience are specific to my project and it's parameters. We're building a number crunching cloud, so our storage needs are much simpler then someone building something that will support a SQL cluster. For us we really need fast networking and storage peformance. We deal with very large files that will only be stored for very short time (a week or 2 week depending on the size of our clients projects). We are building a traditional style cloud using inexpensive off the shelf components, so the 2-3k that we'd spend on SAN software will buy us a lot of hardware. I'm actually really interested in the results David gets because his setup looks to be the next step up for us but I'm still undecided on if we really even need it, Hyper-V replica might actually be good enough for us. I can lose 10 minutes of data, since we can just reprocess it again. 

    I can't comment about clustering and multipath. But I will say SMB used to be one of the slowest ways to get data on and off a server and now it's blazing fast. In my specific setup using Infiniband, SMB 3 is much faster than when I'm using iSCSI, RDMA worked with no configuration needed on my part. Setup took less then 30 seconds (share a folder and go), comparatively iSCSI and NFS take a good deal more effort to setup. So it's free, I can run VMs from it and they get a speed increase do to some sort of caching going on. For a simple setup like mine it's a no-brainer. Hyper-V cluster runs perfectly from SMB shares, so I have live migration working. 

    It may not be ready for every use case and I'd say Microsoft left plenty of room for 3rd party storage vendors to add value but SMB itself is is extremely fast 


    • Edited by Dustyny1 Friday, August 31, 2012 2:40 PM
    Friday, August 31, 2012 2:30 PM
  • Running SANs in the VM wasn't a good option for me ... [ project description skipped to save space ]

    We don't run SAN inside a VMs so there's no guest VM overhead. Your config was pretty much classic that why I wanted to hear a performance related feedback for two node Vs. three node installations. Key point with clustered DAS is - you have all caches sitting on memory bus or PCIe.

    -nismo

    Friday, August 31, 2012 5:25 PM

  • It may not be ready for every use case and I'd say Microsoft left plenty of room for 3rd party storage vendors to add value but SMB itself is is extremely fast 


    This is VERY true. Definitely a huge step forward compared to what we've had before.

    -nismo

    Friday, August 31, 2012 5:26 PM
  • 3 - I'm not sure about. I think you can set up a two-node cluster (in a host or in a guest) with no shared storage, and just use a file share as the witness instead of shared storage. See http://www.sqlskills.com/blogs/jonathan/post/Failover-Clustering-without-a-SAN-SQL-Server-2012-and-SMB-for-Shared-Storage.aspx. If this is an SMB3 SOFS file share, you still have HA.

    Missed that one. You're right. Good shot!

    -nismo

    Friday, August 31, 2012 5:27 PM

  • However, there are several reasons you might want to have a SOFS SMB3 share hosting your production VMs instead of an iSCSI target.

    * Easier setup of new cluster nodes (don't need to install MPIO iSCSI client and set up connections for each cluster node)

    * Live migration for VM storage mobility, to/from non-shared storage and other SOFS targets

    * Space efficiency across multiple clusters - don't need to dedicate separate fixed-size shared disk targets to each cluster

    * Ability to use differencing VHDXs with a base Server 2012 VM image in a single VHDX, to save storage across clusters

    Of course, I still like Starwind to provide the HA software iSCSI target that the SOFS CSV sits on top of. It just seems more convenient to serve it up as a SOFS to other VMs, instead of iSCSI. It seems there are pros and cons.

    1) MPIO is built-in. Very few use own MPIO client side stack these days. I personally see no point.

    2) Not sure how live migration is different when going SAN -> NAS.

    3) and 4) Space efficiency and "golden image" - deduplication take care of that kind of stuff :)

    "Of course..." good point. Thanks for feedback!

    -nismo

    Friday, August 31, 2012 5:31 PM
  • On 1, yes, I'm using the Windows MPIO iSCSI client. But it still requires several steps the first time you set it up on each node:
    * Add MPIO feature
    * Set up your first iSCSI connection
    * Add support for iSCSI in the MPIO control panel. The required check box is grayed out until you add your first iSCSI connection - it's not clear why this is so
    * Reboot to enable MPIO for iSCSI
    * Add second MPIO iSCSI connection

    All this is required before adding the new node to the cluster using iSCSI storage, whereas nothing is required for a network share. So using a SOFS (for a possible witness, and for storage) is much easier - particularly if you have many nodes to set up.

    Also, my experience is the the Microsoft client doesn't always auto reconnect all the iSCSI targets after target maintenance (e.g. switching HA partners). So it seems you may end up getting to a non-redundant state, with very little ability to monitor, apart from an admin manually going into each host directly and checking current iSCSI status)

    On 2, you're right. You should be able to live migrate storage between any combination of NAS/SMB (\\SOFS), SAN (c:\ClusterStorage), and local disk.

    On 3 and 4, I didn't see any option to use deduplication on my Starwind HA target. I'm using the latest 5.8.2059. Right now it seems you can choose either HA or deduplication - but not both.


    Friday, August 31, 2012 6:04 PM
  • I see... So simplicity is a way to go. Sounds promising :)

    V6 is out and it can do both. You may however will to wait for upcoming one with log-structuring to fight I/O blender and flash level cache. Fow now dedupe is slowing down I/O but it will increase IOPS.

    How do you configure "golden images" with MS now? Do you run VDI?

    -nismo

    Friday, August 31, 2012 6:51 PM
  • Very helpful - thanks. I see Starwind v6 was released 3 days ago :-) I now have it installed and have a HA deduplicated iSCSI test target.

    I like the new option to have a 3-way active-active-active - so I can have 2 active in the primary DC, and a failover active in the secondary DC (connected via a GbE metro fibre connection). It also looks like there is async DR replication on top of that, which I can use over WAN to a third DC. I also like the option to mark a node as having better or worse underlying disk performance, so as to better balance iSCSI request when you have asymmetric disk performance. Finally, the UI is better with less manual configuration steps.

    We don't use VDI, we use traditional remote desktop host for hundreds of users in a few large VMs. I think that is more stable and gives better performance, but you need to watch security as there isn't VM isolation.

    I was following Jose Barreto's steps here (http://blogs.technet.com/b/josebda/archive/2012/08/23/windows-server-2012-scale-out-file-server-for-sql-server-2012-step-by-step-installation.aspx#5) to configure a base Server 2012 VHDX and build differencing VHDXs off that.

    But it doesn't work so well if you need to install the same large software packages (SQL, Office, etc.) in each VHDX, as you duplicate data in each child VHDX. Maybe I should install all software in the base VHDX first, then SYSPREP afterwards, and then reinstall all apps in the children to fix any configuration problems caused by SYSPREP? Of course, a deduplicated iSCSI target could be used instead or at the same time.

    Do you have a perspective on the trade-offs of deduplicated storage vs. differencing VMs?

    Friday, August 31, 2012 8:57 PM
  • In-line dedupe is obviously more "expensive" from CPU side and it's a real memory pig. So "golden images" and diffs could be a cheaper and more effective way to go with VDI.

    -nismo

    Monday, September 03, 2012 2:50 AM
  • You are absolutely correct, I just realized what's going on, thank you.

    When I use netperf between the node and the SMB server I see transfer rates of 2GBs, something I figure since I was 1GB below the PCIe bus limitation I could tuned some network settings and get a bit closer to the 3GB limitation.. now I'm thinking that could actually be the overhead that RDMA is meant to address. 

    Though the storage system on my SMB server can read and rewrite from the disks at about 2.5GBs, when I ran a the benchmark in the VM I actually got a 500MB bump to 3GBs. So it looks to me like there is some sort of read/write caching going on but I'm not 100% where this is taking place. 

    Since I'm getting limited by the PCIe bus to 3GBs, I bet I could figure out how much of boost I'm getting by using just one disk instead of a stripe set.  If I take one disk use that as a baseline benchmark it locally and then across SMB (inside a VM) then the difference should be the caching benefit. VM benchmark - local benchmark = cache benefit. I'll try to do some testing once I get my storage sorted. 

    I'm pretty amazed by how much they've improved SMB, to me they've completely destroyed the need for iSCSI or NFS. It's fast as hell and has 0% learning curve, it just works. 

    The speeds that you are mentioning above were they achieved over the network on a single 1G Ethernet card ? or did you use any nic bonding technique ?
    Thursday, March 27, 2014 5:03 AM