none
Windows Server 2012 - Hyper-V - iSCSI SAN - All Hyper-V Guests stops responding and extensive disk read/write RRS feed

  • Question

  • We have a problem with one of our deployments of Windows Server 2012 Hyper-V with a 2 node cluster connected to a iSCSI SAN.

    Our setup:

    Hosts - Both run Windows Server 2012 Standard and are clustered.

    • HP ProLiant G7, 24 GB RAM, 2 teamed NIC dedicated to Virtual Machines and Management, 2 teamed NIC dedicated to iSCSI storage. - This is the primary host and normaly all VMs run on this host.
    • HP ProLiant G5, 20 GB RAM, 1 NIC dedicated to Virtual Machines and Management, 2 teamed NIC dedicated to iSCSI storage. - This is the secondary host that and is intended to be used in case of failure of the primary host.
    • We have no antivirus on the hosts and the scheduled ShadowCopy (previous version of files) is switched of.

    iSCSI SAN:

    • QNAP NAS TS-869 Pro, 8 INTEL SSDSA2CW160G3 160 GB i a RAID 5 with a Host Spare. 2 Teamed NIC.

    Switch:

    • DLINK DGS-1210-16 - Both the network cards of the Hosts that are dedicated to the Storage and the Storage itself are connected to the same switch and nothing else is connected to this switch.

    Virtual Machines:

    • 3 Windows Server 2012 Standard - 1 DC, 1 FileServer, 1 Application Server.
    • 1 Windows Server 2008 Standard Exchange Server.
    • All VMs are using dynamic disks (as recommended by Microsoft).

    Updates

    • We have applied the most resent updates to the Hosts, WMs and iSCSI SAN about 3 weeks ago with no change in our problem and we continually update the setup.

    Normal operation

    • Normally this setup works just fine and we see no real difference in speed in startup, file copy and processing speed in LoB applications of this setup compared to a single host with 2 10000 RPM Disks. Normal network speed is 10-200 Mbit, but occasionally we see speeds up to 400 Mbit/s of combined read/write for instance during file repair

    Our Problem

    • Our problem is that for some reason all of the VMs stops responding or responds very slowly and you can for instance not send CTRL-ALT-DEL to a VM in the Hyper-V console, or for instance start task manager when already logged in.

    Symptoms (i.e. this happens, or does not happen, at the same time)

    • I we look at resource monitor on the host then we see that there is often an extensive read from a VHDX of one of the VMs (40-60 Mbyte/s) and a combined write speed to many files in \HarddiskVolume5\System Volume Information\{<someguid and no file extension>}. See iamge below.
    • The combined network speed to the iSCSI SAN is about 500-600 Mbit/s.
    • When this happens it is usually during and after a VSS ShadowCopy backup, but has also happens during hours where no backup should be running (i.e. during daytime when the backup has finished hours ago according to the log files). There is however not that extensive writes to the backup file that is created on an external hard drive and this does not seem to happen during all backups (we have manually checked a few times, but it is hard to say since this error does not seem leave any traces in event viewer).
    • We cannot find any indication that the VMs themself detect any problem and we see no increase of errors (for example storage related errors) in the eventlog inside the VMs.
    • The QNAP uses about 50% processing Power on all cores.
    • We see no dropped packets on the switch.

    (I have split the image to save horizontal space).

    Unable to recreate the problem / find definitive trigger

    • We have not succeeded in recreating the problem manually by, for instance, running chkdsk or defrag in VM and Hosts, copy and remove large files to VMs, running CPU and Disk intensive operations inside a VM (for instance scan and repair a database file).

    Questions

    • Why does all VMs stop responding and why is there such intensive Read/Writes to the iSCSI SAN?
    • Could it be anything in our setup that cannot handle all the read/write requests? For instance the iSCSI SAN, the hosts, etc?
    • What can we do about this? Should we use MultiPath IO instead of NIC teaming to the SAN, limit bandwith to the SAN, etc?
    Wednesday, January 30, 2013 1:26 PM

All replies

  • I see a few points where you might be able to improve things, but nothing that directly points to your problem. I wouldn't run Exchange databases on a dynamic disk and I'm pretty sure Microsoft doesn't recommend it, either. It should be fine for everything else, though. I would definitely say to stop teaming the iSCSI links and use MPIO instead. For a QNAP, that means the SAN links stay teamed but the server team is broken and MPIO is enabled. However, both of these changes are just going to net you some more performance and in normal operations you probably won't notice the difference. I highly doubt they'll meaningfully address this problem.

    Your screenshot indicates to me that at least one of the Hyper-V hosts is performing something on the CSV. You've already ruled out Previous Versions and AV. I don't have a Server 2012 computer on hand to provide direct guidance, but you can use Task Manager and/or Performance Monitor to track I/O usage on individual processes. That might help you narrow down the suspects list.


    Eric Siron http://www.altaro.com/hyper-v-backup/


    • Edited by Eric SironMVP Wednesday, January 30, 2013 3:20 PM typo
    Wednesday, January 30, 2013 3:19 PM
  • Hallo Eric!

    I agree that we would get a bit better performance if we used static virtual disks, but it is a bit unclear that MPIO will increase performance (I have read many articles about this and there seem to be an ongoing debate which is better on Windows Server 2012). I also come to the same conclusion that none of these changes will solve the problem since normaly performance not an issue.

    But, here is a question:

    • Would MPIO change the behaviour of accessing the iSCSI SAN so that one virtual machine may not use more bandwith than one NIC allows, hence leaving the rest for other VMs and other traffic, or will this be ruffly the same?

    We have of cource looked in Task Manager and Resource Monitor (the image is from Resource Monitor) on both the host and when possible inside the VMs. As you see in our images this does not give anything since the "image" (application that runs writes to disk) is only called "system". The same goes for the network traffic. The question is what "system" is doing and why and if it might be this that causes the VMs to not respond.

    Is there any way to detarmine what "HarddiskVolume5" is for a volume (c, d, cluster, temporary vhd or what)? 

    Do you or anyone else (preferably someone that runs Windows Server 2012 and iSCSI) have any thoughts about this?

    Wednesday, January 30, 2013 4:32 PM
  • As far as I'm concerned, using NIC teaming for iSCSI is pretty much like using a flathead screwdriver to turn a Phillips screw. What pushes it over the edge for me is that the MPIO wheel is far rounder and much more all-terrain than any form of iSCSI-over-NIC-team mechanism. It wouldn't be something I'd spend a bunch of time debating, though.

    My language here is going to be imprecise for the sake of expediency: MPIO can split communications transmissions in a way that teaming cannot. In practice, the end results are typically difficult to distinguish. A VM could theoretically saturate the bonded link and starve the others out more easily on MPIO than on teaming. However, the host is in charge of I/O scheduling for anything but a pass-through situation so I wouldn't spend a lot of time on that hunt.

    I did misinterpret your screenshot to be something from the SAN perspective because I was sitting here looking at a Windows 7 version and my blood-caffeine content wasn't high enough to make the translation properly. Sorry about that.

    Try going into DISKPART and running LIST VOLUME.


    Eric Siron http://www.altaro.com/hyper-v-backup/

    Wednesday, January 30, 2013 6:25 PM
  • iSCSI SAN:

    • QNAP NAS TS-869 Pro, 8 INTEL SSDSA2CW160G3 160 GB i a RAID 5 with a Host Spare. 2 Teamed NIC.

    That's the problem. Never use NIC teaming with iSCSI. Configure MPIO and Round Robin.

    Also you do run dedicated wiring for management and SAN traffic. Also not a way to go. Keep them mixed, just create VLANs and do QoS for minor traffic sources (like management).

    Do reconfigure and run the thing for a while. If you'd continue to have the same issues contact QNAP support. For me it sounds like I/O queue length exceeded on QNAP so device is flushing the caches keeping incoming requests frozen.


    StarWind iSCSI SAN & NAS

    Wednesday, January 30, 2013 9:39 PM
  • Hallo Eric and VR38DETT!

    Thanks for your replies!

    Eric,

    • "list volume" in diskpart does not reveal what the "HarddiskVolume5" is.
    • Do you think the problem might get worse if we use MPIO instead of NIC Teaming on the Host?

    VR38DETT

    • Do you suggest not using NIC Teaming on the QNAP as well? That is, run both NICs as standalone with each having its own IP address? Will MPIO still work?
    • Many scenarios that I have seen using MPIO suggests using different subnets, is this a requirement for using MPIO or is this just a way to make sure that you do not run out of IP adressess?
    • Why should it be better to not have dedicated wireing for iSCSI and Management? The Hosts have 3-4 1Gbit NICs, but the QNAP does only have 2, so using more NICs for iSCSI should not improve anything in this scenario.

    We also think that it is the QNAP that cannot handle all I/O requests, but unfortionally we still do not know the why this happens. Trying to tune performance in such a way that the QNAP will receive an even grater load does not seem to be a good idea, but we could of course test this and see if it makes any difference, but only if you belive that this would fix the problem!

    Please do not provide an answer that you cannot/will not give a good reason for!

    Thursday, January 31, 2013 1:47 AM
  • We have had some success in identifying the source of the extensive read/write operations.

    • HarddiskVolume 5, must be the ClusterStorage since this is the fifth volume (i.e. Volume 4) that is displayed in Diskpart.

    I therefore checked the "c:\ClusterStorage\Volume1\System Volume Information" folder during backup and large files was created there at the same time as something was reading from a VHDX file of an other VM (i.e. not the VM that was currently backed up) at the same speed.

    The backup might trigger this behavior, but the backup was, as expected, reading from \Device\HardDiskVolumeShadowCopy# path at a much slower rate and therefore this might not be connected.

    I read several articels suggesting that the Volume Shadow Copy service (previous versions) might be used even when it is disabled as stated in the last comment on this post:

    • http://social.technet.microsoft.com/Forums/en-US/winserverhyperv/thread/fea72e78-a544-417a-9187-05f0ceda347d/

    There is, as far as I know, no need for our simple VSS backup to create a file in the "System Volume Information" folder to successfully create a backup and nothing ever read any data from this file. One strange thing is that this very large file was deleted when a backup of another VM (i.e. not the same the one that was copied) had finished.

    The above post suggest activating Shadow Copies on the Hosts and to set the limit as low as possible to get rid of this behaviour.

    Any thoughts on this? Will this solve the problem or create new ones?


    • Edited by Thomas N_ Thursday, January 31, 2013 2:15 AM
    Thursday, January 31, 2013 2:14 AM
  • Hi,

    > All VMs are using dynamic disks (as recommended by Microsoft).

    If this is a testing environment, it’s okay, but if this a production environment, it’s not recommended. Fixed VHDs are recommended for production instead of dynamically expanding or differencing VHDs.

    Hyper-V: Dynamic virtual hard disks are not recommended for virtual machines that run server workloads in a production environment
    http://technet.microsoft.com/en-us/library/ee941151(v=WS.10).aspx

    > This is the primary host and normaly all VMs run on this host.

    According to your posting, we know that you have Cluster Shared Volumes in the Hyper-V cluster, but why not distribute your VMs into two Hyper-V hosts.

    Use Cluster Shared Volumes in a Windows Server 2012 Failover Cluster
    http://technet.microsoft.com/en-us/library/jj612868.aspx

    > 2 teamed NIC dedicated to iSCSI storage.

    Use Microsoft MultiPath IO (MPIO) to manage multiple paths to iSCSI storage. Microsoft does not support teaming on network adapters that are used to connect to iSCSI-based storage devices. (At least it’s not supported until Windows Server 2008 R2. Although Windows Server 2012 has built-in network teaming feature, I don’t article which declare that Windows Server 2012 network teaming support iSCSI connection)

    Understanding Requirements for Failover Clusters
    http://technet.microsoft.com/en-us/library/cc771404.aspx

    > I have seen using MPIO suggests using different subnets, is this a requirement for using MPIO
    > or is this just a way to make sure that you do not run out of IP adressess?

    What I found is: if it is possible, isolate the iSCSI and data networks that reside on the same switch infrastructure through the use of VLANs and separate subnets. Redundant network paths from the server to the storage system via MPIO will maximize availability and performance. Of course you can set these two NICs in separate subnets, but I don’t think it is necessary.

    > Why should it be better to not have dedicated wireing for iSCSI and Management?

    It is recommended that the iSCSI SAN network be separated (logically or physically) from the data network workloads. This ‘best practice’ network configuration optimizes performance and reliability.

    Check that and modify cluster configuration, monitor it and give us feedback for further troubleshooting.

    For more information please refer to following MS articles:

    Volume Shadow Copy Service
    http://technet.microsoft.com/en-us/library/ee923636(WS.10).aspx
    Support for Multipath I/O (MPIO)
    http://technet.microsoft.com/en-us/library/cc770294.aspx
    Deployments and Tests in an iSCSI SAN
    http://technet.microsoft.com/en-US/library/bb649502(v=SQL.90).aspx

    Hope this helps!

    TechNet Subscriber Support

    If you are TechNet Subscription user and have any feedback on our support quality, please send your feedback here.


    Lawrence

    TechNet Community Support

    Thursday, January 31, 2013 7:06 AM
    Moderator
  • Hallo Lawrence!

    I thank you for taking the time to provide a detailed answer, but unfortionately most of your links relate to rather old documentation and older versions of Windows Server.

    We use, as stated in my origional post, Windows Server 2012 Hyper-V:

    Dynamic disks are the default recommendation as stated in the "Performance Tuning Guidelines for Windows Server 2012" that can be downloaded from this link: http://download.microsoft.com/download/0/0/B/00BE76AF-D340-4759-8ECD-C80BC53B6231/performance-tuning-guidelines-windows-server-2012.docx and it is also the default format both when creating new VMs and new Virtual Disks in Hyper-V Manager.
    Our configuration has passed the Failover cluster validation wizard and is therefore supported by Microsoft.
    We could of cource distribute the "load" over both Hosts, but the primary host only uses about 20-30% of its capasity so this is not the problem

    MPIO might improve performance, but probably not much since we only have one LUN.

    http://social.technet.microsoft.com/Forums/en-US/winserver8gen/thread/a4ec7e18-9200-4137-a1cf-c171ec2cc79b/


    The questiion remains:

    Why is a VHDX file copied by "system" to the "System volume information" folder inside the cluster at speeds that surpasses the top speed during normal operation. This is evendently what causes the problem, but WHY is this happening?
    Shadow Copy (previous versions) is disabled, but what could else be causing this behavior?

    We manually ran backup of the 4 virtual servers 1 or 2 at a time, and I monitored and all of a sudden a VHDX for the Fileserver, that was not currently backed up, was copied to "System Volume Information" which caused all the virtual machines to stop responding.

    The reason we ran backups of two virtual servers at a time was to try to trigger the behavior and it worked fine, but as I said, we had not started the backup of the fileserver (only contains file and print service, no applications) when its system disk was copied (that is, the entire file 25 GB system drive that does not have any shares) to the System Volume Information folder of the cluster as fast as possible.
    Can anyone provide some insight on this? What could possibly be causing a VHDX to be copied to System Volume Information when that VM is not even backed up?
    Just to clearify: We backup using Child Partition snapshot backup and during normal backup operation we have a read of about 5-15 MByte/s from a temporary volume called \HarddiskVolumeShadowCopy####\VMs\<VM name>\Virtual Hard Disks\<VHD-name>.vhdx (for instance \HarddiskVolumeShadowCopy24017\VMs\TF-DC01\Virtual Hard Disks\HD01.vhdx as you see in the image provided in my original post).

    As far as I understand, no VHDX of a virtual machine should ever be copied to the "System Volume Information" folder of the cluster!

    Can anyone provide some insight on this? What could possibly be causing a VHDX to be copied to System Volume Information when that particular VM is not even backed up?

    Thursday, January 31, 2013 8:34 AM
  • Thomas,

    I wouldn't recommend spending a lot of time on the MPIO vs. teaming thing. Yes, MPIO is superior, more supported, etc., but it's not going to significantly address your issue. Revisit that question once you solve the problem. I don't know what the problem is, but it's very clear that the host is requesting high I/O and all of your systems are providing it to the best of their abilities. MPIO might squeeze a few more MB/s out of the whole process and reduce its duration but that's it.

    To answer one of your other questions, no, you don't break the team on the QNAP to enable it for MPIO -- at least, that's the QNAP-supported method. For another, you don't want to use different subnets for MPIO in your situation. The reason is that the QNAP team will stay in a single subnet and once you start routing iSCSI, you erase MPIO's benefits over teaming. I have used MPIO in multi-subnet and single-subnet environments and was never able to reliably measure a difference unless routing was involved. However, I would at least consider putting iSCSI traffic in its own network away from others. That said, the largest effect of segregating this traffic will be to reduce the impact of broadcast traffic and will not address your problem.

    Also, the "only use fixed in production" recommendation is just one of those things that people come up with by reading whitepapers. The fixed format is a little faster but unless you're using a high I/O VM, you have to run a benchmark to tell the difference. The biggest concern with dynamic disks is and always be the inherent risk in thin-provisioning any resource. This is also something I would recommend that you revisit later if you're actually concerned. There is absolutely no way that converting your VHDs to fixed format will address your problem. In fact, since it seems to be duplicating the entire .VHD, it will probably make the problem worse.

    I hate to say it because I know you've already looked at this, but everything about this is screaming "Previous Versions" or something of the kind. Have you checked Scheduled Tasks? Anything installed on the system, maybe even an orphaned component leftover from testing, that would trigger anything like this? Anything suspicious in "vssadmin list writers"? Any hints in "vssadmin list shadows"?


    Eric Siron http://www.altaro.com/hyper-v-backup/

    Thursday, January 31, 2013 2:59 PM
  • Hallo Eric!

    Thankyou for your answer. I agree that a fixed VHD would infact make the problem worse and that MPIO will not solve the problem, allthough we will probably change the configuration ones the real cause of this problem has been found.

    Our setup is a standard install of Windows Server 2012 with the Hyper-V role and the cluster role added. I have not found any scheduled task that could cause this behavior, but it is ofcource possible.

    Here is the list of vss writers:

    vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
    (C) Copyright 2001-2012 Microsoft Corp.

    Writer name: 'Task Scheduler Writer'
       Writer Id: {d61d61c8-d73a-4eee-8cdd-f6f9786b7124}
       Writer Instance Id: {1bddd48e-5052-49db-9b07-b96f96727e6b}
       State: [1] Stable
       Last error: No error

    Writer name: 'VSS Metadata Store Writer'
       Writer Id: {75dfb225-e2e4-4d39-9ac9-ffaff65ddf06}
       Writer Instance Id: {088e7a7d-09a8-4cc6-a609-ad90e75ddc93}
       State: [1] Stable
       Last error: No error

    Writer name: 'Performance Counters Writer'
       Writer Id: {0bada1de-01a9-4625-8278-69e735f39dd2}
       Writer Instance Id: {f0086dda-9efc-47c5-8eb6-a944c3d09381}
       State: [1] Stable
       Last error: No error

    Writer name: 'System Writer'
       Writer Id: {e8132975-6f93-4464-a53e-1050253ae220}
       Writer Instance Id: {7848396d-00b1-47cd-8ba9-769b7ce402d2}
       State: [1] Stable
       Last error: No error

    Writer name: 'Microsoft Hyper-V VSS Writer'
       Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}
       Writer Instance Id: {8b6c534a-18dd-4fff-b14e-1d4aebd1db74}
       State: [5] Waiting for completion
       Last error: No error

    Writer name: 'Cluster Shared Volume VSS Writer'
       Writer Id: {1072ae1c-e5a7-4ea1-9e4a-6f7964656570}
       Writer Instance Id: {d46c6a69-8b4a-4307-afcf-ca3611c7f680}
       State: [1] Stable
       Last error: No error

    Writer name: 'ASR Writer'
       Writer Id: {be000cbe-11fe-4426-9c58-531aa6355fc4}
       Writer Instance Id: {fc530484-71db-48c3-af5f-ef398070373e}
       State: [1] Stable
       Last error: No error

    Writer name: 'WMI Writer'
       Writer Id: {a6ad56c2-b509-4e6c-bb19-49d8f43532f0}
       Writer Instance Id: {3792e26e-c0d0-4901-b799-2e8d9ffe2085}
       State: [1] Stable
       Last error: No error

    Writer name: 'Registry Writer'
       Writer Id: {afbab4a2-367d-4d15-a586-71dbb18f8485}
       Writer Instance Id: {6ea65f92-e3fd-4a23-9e5f-b23de43bc756}
       State: [1] Stable
       Last error: No error

    Writer name: 'BITS Writer'
       Writer Id: {4969d978-be47-48b0-b100-f328f07ac1e0}
       Writer Instance Id: {71dc7876-2089-472c-8fed-4b8862037528}
       State: [1] Stable
       Last error: No error

    Writer name: 'Shadow Copy Optimization Writer'
       Writer Id: {4dc3bdd4-ab48-4d07-adb0-3bee2926fd7f}
       Writer Instance Id: {cb0c7fd8-1f5c-41bb-b2cc-82fabbdc466e}
       State: [1] Stable
       Last error: No error

    Writer name: 'Cluster Database'
       Writer Id: {41e12264-35d8-479b-8e5c-9b23d1dad37e}
       Writer Instance Id: {23320f7e-f165-409d-8456-5d7d8fbaefed}
       State: [1] Stable
       Last error: No error

    Writer name: 'COM+ REGDB Writer'
       Writer Id: {542da469-d3e1-473c-9f4f-7847f01fc64f}
       Writer Instance Id: {f23d0208-e569-48b0-ad30-1addb1a044af}
       State: [1] Stable
       Last error: No error

    As expected "vssadmin list shadows" returns "No items found that satisfy the query.", that is no shadowcopies has been found.

    Please note that I ran that command when the VMs are working and when the "System volume information" folder is empty so I might get Another result when the problem occurs.

    Does anyone know what could possibly cause a VHDX to be copied to "System volume information" using a normal filecopy with such high priority that nothing else can access the SAN when the file is copied?

    Sunday, February 3, 2013 4:41 PM
  • Hi Thomas, 

    Could you please provide the system event and application event when the issue occurred? Thanks. 

    Kevin Ni


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.

    Friday, June 7, 2013 2:46 AM
  • I know this is an old post, but I just wanted to comment that I stumbled upon this thread because I am having issues with my DGS-1210-16 switch.

    When copying very largs files (Several GB) using standard SMB traffic between a Windows 2008 R2 and a workstation, where speeds are about 50-100MB/sec. the switch eventually disconnect my workstation. Well, actually, the workstation is still connected, but no traffic is passing in any direction. I can't ping anything until i either wait for an extended amount of time or disable/enable the network card on the workstation. The server doesn't seem to suffer from this. It's using a dedicated offloading HP netcard.

    The workstation used was first a Lenovo Thinkpad W510 with a Plextor 256GB SSD. It has since been upgraded to a Thinkpad W530 with same disk. The problem remains. If i place the server and the workstation in another switch it just copies the hole lot without an issue.

    I am running the latest firmware on the switch and if it wasn't because the new switch was DOA i would be running on a different brand switch today :)

    Regards
    Christian

    Thursday, February 20, 2014 10:14 AM