Tuesday, December 20, 2011 3:47 PM
We are in the process of creating a new Hyper-V Failover Cluster with several VM's on top of that. I have the following configuration:
- HP P4500 G2 14.4TB 2 node Lefthand storage, latest patches on SAN/iQ v9.5, 3 Thin Volumes of 1 TB used for Virtual Machine storage;
- Jumbo frames is enabled on the iSCSI network;
- DPM 2010 server, connected to LAN and iSCSI network with dedicated NIC's, latest HP DSM (MPIO) and Application Aware Snapshot Manager installed;
- Hyper-V Failover Cluster consisting of 5 Hyper-V Server 2008 R2 servers, latest HP DSM (MPIO) and Application Aware Snapshot Manager installed;
- 28 Virtual Machines (DC, Exchange 2010 servers, Citrix XenApp and XenDesktop servers, File/print, SQL and Provision services).
The following networks are configured on the Hyper-V servers, all separate NIC's:
- Heartbeat (private VLAN between Hyper-V servers)
- CSV/Live Migration (private VLAN between Hyper-V servers)
- Virtual Machines
Also metrics are set and confirmed, the CSV/Live Migration NIC's are used when enabling redirected mode for the CSV's.
HP has confirmed that hardware VSS is supported for DPM 2010 since SAN/IQ 9.0. The CSV volumes are also presented to the DPM server, I have connected to the volumes using the iSCSI IP address as the source (volumes are offline). I've created a Protection Group including all the VM's. When a recovery point is made or consistency check is run, DPM puts the CSV's (depending which VM is backed up) in redirected mode, creates a snapshot for the volume and presents it to the DPM server as a volume (not connected). I also found out that it makes a snap for every VM, so sometimes like 4 of 5 snap's per volume. After that some CSV's kept in redirected mode or even went down (offline) and VM's residing on that crashed eventually.
To work around this problem I enabled serialized backup: http://robertanddpm.blogspot.com/2010/07/enabling-serialized-backup-of-hyper-v.html I know this is used when you don't have the option for hardware VSS, but this seems to take the 'load' of the Hyper-V Cluster.
The effect is now that when a certain VM is backed up, the volume where the VM is situated goes in redirected mode, a snap is made (not always directly, but mosltly after a minute or 2), backup process starts and the CSV goes back to online. Only 3 VM's (1 per CSV) are backed up simultaniously. After the backup has completed for one VM, the snap is deleted and the process (redirected mode, snap, etc) starts all over for the next VM. This seems to work ok and this is also satisfying in therms of performance and backup window. But for every VM that's being backed up, this error appears in the Failover Cluster Manager:
Cluster Shared Volume 'Volume1' ('CSV01') is no longer directly accessible from this cluster node. I/O access will be redirected to the storage device over the network through the node that owns the volume. This may result in degraded performance. If redirected access is turned on for this volume, please turn it off. If redirected access is turned off, please troubleshoot this node's connectivity to the storage device and I/O will resume to a healthy state once connectivity to the storage device is reestablished.
Also, it seems like that the CSV is not available while the CSV is in redirected mode. The Failover Manager of the P4500 storage (a Hyper-V virtual appliance, also residing on CSV storage but not made highly available) reports not being online for about a minute and then comes back.
When n manually put the CSV in redirected mode, no error's are shown and all works fine.
I have a couple of questions:
1. Is it normal behaviour for the DPM to first put the CSV in redirected mode and after that create the snapshot and return the CSV to online again?
2. If so, is there somehow a way for DPM to wait for the snap to be created and then start to backup?
3. Can I force to use 1 particular NIC to be used for the backup? I want to use the iSCSI NIC (Jumbo frames are enabled for that NIC, it has an IP address in the range of the storage). Right now all traphic goes through the Management NIC of the DPM server (the iSCSI network is still accessible from the LAN for testing purposed at the moment).
Thanks in advance.
Wednesday, December 21, 2011 9:56 AM
Small update: I disabled the Trunk between the LAN and the iSCSI network so the DPM server is forced to use the iSCSI NIC to access the P4500 storage. But still I see all traphic routed through the Management NIC??
What I can see is that the Hyper-V host server which is hosting the VM that is being backed up has high traphic on the iSCSI network while performing the consistency check and 'routes' changes via the Management network to the DPM server? Am I right and is this by design? That kind of ruins the use of mounting volume snaps to the DPM server.
Saturday, December 24, 2011 4:24 PMUhm, anyone?
Wednesday, December 28, 2011 8:46 PM
I have been struggeling with the same thing. The most important is that when you use a hardware provider the CSV should NEVER go in redirected mode. If it is, the installed hardare provider is not running correctly. Please make sure the hardware provider is also installed on the DPM server. If it's not it will use Microsoft VSS to make the backups and the CSV wil be in redirected state for as long as the back-up runs. If the CSV goes into redirected mode the traffic that should go over the ISCSI adapter will be redirected over a network that is dedicated for cluster traffic.
Hope this helps,
Tuesday, January 03, 2012 7:40 AM
Hi Marthijn, thank you for your reply!
What I understand of your reply is that the CSV should never go in redirected mode before creating a snapshot? VSS Provider (Application Aware Snapshot Manager that is) is installed on all Hyper-V nodes as well as the DPM server, else I would not be able to create snapshots.
When running a full backup of all VM's in the Protection Group I now also see various other errors in Failover Cluster Manager:
Cluster resource 'CSV01' in clustered service or application 'e0405784-d36f-4803-9fff-49387aed559e' failed.
Cluster physical disk resource 'CSV01' cannot be brought online because the associated disk could not be found. The expected signature of the disk was '34E4BAAD'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.
The Cluster service failed to bring clustered service or application 'e0405784-d36f-4803-9fff-49387aed559e' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.
Cluster Shared Volume 'Volume3' ('CSV02') is no longer available on this node because of 'STATUS_VOLUME_DISMOUNTED(c000026e)'. All I/O will temporarily be queued until a path to the volume is reestablished.
Cluster resource 'CSV01' (resource type '', DLL 'clusres.dll') either crashed or deadlocked. The Resource Hosting Subsystem (RHS) process will now attempt to terminate, and the resource will be marked to run in a separate monitor.
The VM's kept running, maybe because the downtime of the CSV was not long enough. But this is not right.
- Edited by Marco Schilder Tuesday, January 03, 2012 7:53 AM
Tuesday, January 03, 2012 3:12 PM
Ok, first things firsts: When you use DPM 2010 with an agent installed on the Hyper-V hosts you don't need any additional software to mack snapshots on your VM's. On the Hyper-V hosts the Hyper-V VSS writer is used for this.
When a CSV is used more Hyper-V hosts can acces this volume at the same time, and therefore eacht host can host a VM which is running on that CSV. The CSV does have an owner. Check your cluster admin console. If DPM creates a snapshot form a VM that is on another hosts then the CSV owner, ownership is move to that specific host and other host cannot directly communicate to the CSV anymore. INstead a cluster netwerk is used to re-route the traffic. The CSV wil show up as redirect acces in the admin console.
If you have any problem on the cluster network that is used for the redirection this behavior can be expected.
When a VSS hardware provider is used these things will not hapen.
I am not sure if i am on the right track here in trying to answer your question. Please let me know if i'm not.
Monday, January 09, 2012 12:56 PM
Ok, after a lot of research and some trial and error, it seems like the XenApp servers are keeping a lock on the CSV's. Even though the XenApp servers are not included in the protection group! This is probably because of that they are provisioned of another virtual machine (the Citrix Provisioning Server), the parent VHD for the XenApp servers is located on that server. I will do another test tonight with all XenApp servers shut down, but the first tests seem promissing.
Of course I can schedule a stop / start at night for the XenApp servers, but isn't there another way to get around this problem? Is this a know issue, that when a virtual machine doesn't "own" all it's VHD disks, it keeps a lock on the CSV?
- Edited by Marco Schilder Monday, January 09, 2012 12:56 PM
Wednesday, January 11, 2012 11:48 PMModerator
The first thing I'd like to address is the statement that "The CSV should never go in to redirected mode when using a hardware VSS provider." When the backup agent takes a backup using a software snapshot, the CSV volume remains pinned to a single node not only for the entire duration of the snapshot but also for the duration of the actual backup. Hardware snapshots on the other hand, are ideal for the CSV environment. They allow the CSV to resume direct I/O mode as soon as the hardware snapshot has been taken. This duration is typically very short, about 2 minutes. As a result more VMs can be backed up in parallel with hardware snapshots than software snapshots.
This information was taken from a blog titled "Snapshot Provider Considerations while backing up a CSV Cluster" at http://blogs.technet.com/b/asim_mitra/archive/2009/12/11/snapshot-provider-considerations-while-backing-up-a-csv-cluster.aspx.
As for your lastest findings with XenApp servers I assume no testing has been done to determine the expected behavior when using this solution. We don't put a lock on the CSV but more accurately the CSV is made local to the node where the VHD resides when a VM is being backed up and all I/O for VM's on other nodes is routed over the network (redirected mode access) through the CSV filter on the node that stores the VHD being backed up. The hardware VSS provider allows for redirected mode access to be used for much shorter period than when using the software VSS provider.
Another good source of further clarification is available in Understanding Protection for CSV at http://technet.microsoft.com/en-us/library/ff634189.aspx
Thursday, January 12, 2012 9:10 PM
Thanks for clearing up the redirection on a CSV. This is some usefull info and changed my perception on this item instantly.