Server R2 CSV frequently hangs
- I've setup a Cluster R2 consits of two nodes HP DL360G6, QLogic 8G HBA and a EVA 4400 as Storage. A 2TB Vol is presented to both hosts and Configured as Cluster shared Volume. Basicly this configuration is working and for some hours without Errors.
Now I've tried to migrate two VMs form a Cluster 2008 RTM with LUN per VM from two different hosts to each one of my R2 Cluster Nodes. The result are terrible problems accessing disk. Two other already running VMs becomed unaccessible therefore.
I've had a look into critical events and found this two - but not much less than the hangs:
1: Cluster Shared Volume 'Volume1' ('ClusterSharedVol1') is no longer available on this node because of 'STATUS_VOLUME_DISMOUNTED(c000026e)'. All I/O will temporarily be queued until a path to the volume is reestablished.
2: Cluster resource 'ClusterSharedVol1' in clustered service or application 'bca76872-09d5-4b19-9b3e-0ed0bb7c1baa' failed.
Both errors rised just on one of the two nodes.
Both nodes are running the same OS with the same drivers an the same firmware on the same SAN. MS MPIO without any extension is used because HP drivers for EVA and 2008R2 are just missing.
I've also tried the redirected Access for bypassing the host which rised the events - same result.
Does anyone set up a configuration like this? Does it work? Ever heard about the rised events? Any hint?
Thank you,
Stefan Heinz
Answers
- Hi Mervyn,
I'am using System Center Virtual Machine Manager (SC VMM) 2008 R2 and there the 'Move VM' wizzard. I selected the C:\ClusterStorage\Volume1 and it created C:\ClusterStorage\Volume1\VM-Name\VM.vhd and the subfolders Snapshots and Virtual Machines.
The source of the VMs was a Cluster 2008 with exclusive LUN per VM - so the VM was and is high-avaliable.
I also know this article but miss a description to do it with scvmm - which is a important tool in my opinion.
After the migration the System works stable again... Strange thing...
Thanks,
Stefan- Marked As Answer byMervyn ZhangMSFT, ModeratorTuesday, September 29, 2009 7:59 AM
- Hi Mervyn,
I guess I have it.
In the performance-Logs we've found a issue in the Storage Path. To solve this, I've installed the QLogic Sansurfer tool and configured the Qlogic 8G QLE2562 Fibre Channel HBAs with it. <update> It also does not work with the current QLogic driver 9.1.8.17. So i had to go back to the windows-included driver version 9.1.8.6. </update>
After this everything worked as expected.
Now I have up to 450 MB/s uninterrupted access to my 'Harddisk' out of a VM during a BITS transfer all on a single Cluster Shared Volume.
Thank you for helping me!
Stefan- Edited byStefan Heinz Friday, October 16, 2009 7:51 AM
- Marked As Answer byMervyn ZhangMSFT, ModeratorFriday, October 16, 2009 1:12 AM
OK For all who have same problems in future:
- Open perfmon
- Add a counter (green plus)
- Select counter Physical Disk, % Idle Time, All Instances
The Idle Time should be near 100% during no Access to disk and higher than 70% in average during normal disk access.
Then copy a 2GB file to the SAN disk and look for the counter related to the disk.
If the counter now goes lower than 10% in avg and the transmission rate is not absolutely poor, it seems to be the same problem like mine. In words: You have a problem with the HBA, HBA Drivers and Settings, Fibre Channel Switch and not with the SAN storage.
We've found this problem also on a SQL server now...
Hope this helps,
Stefan- Marked As Answer byMervyn ZhangMSFT, ModeratorFriday, October 16, 2009 11:20 AM
All Replies
- Hi Stefan,
Currently we need more information for research. Please let us know how you migrated the VMs to R2 Cluster Nodes, detailed steps are helpful, including where you store the VM and how you added VMs into Cluster.
Please refer to "5) Create VMs on CSV Disks" and "6) Make your CSV VMs Highly-Available" of the following article to setup a new Test VM to verify your Cluster configuration.
Deploying Cluster Shared Volumes (CSV) in Windows Server 2008 R2 Failover Clustering
http://blogs.msdn.com/clustering/archive/2009/02/19/9433146.aspx
You may also follow other steps to check your settings.
Thanks.
This posting is provided "AS IS" with no warranties, and confers no rights. - Hi Mervyn,
I'am using System Center Virtual Machine Manager (SC VMM) 2008 R2 and there the 'Move VM' wizzard. I selected the C:\ClusterStorage\Volume1 and it created C:\ClusterStorage\Volume1\VM-Name\VM.vhd and the subfolders Snapshots and Virtual Machines.
The source of the VMs was a Cluster 2008 with exclusive LUN per VM - so the VM was and is high-avaliable.
I also know this article but miss a description to do it with scvmm - which is a important tool in my opinion.
After the migration the System works stable again... Strange thing...
Thanks,
Stefan- Marked As Answer byMervyn ZhangMSFT, ModeratorTuesday, September 29, 2009 7:59 AM
- Hi,
Glad to hear to hear the system works stable again. I suggest we monitor it for some days. If you have more questions in the future, you’re welcomed to this forum.
Thanks.
This posting is provided "AS IS" with no warranties, and confers no rights. - Hi Mervyn,
...stable again, as long I do not make any configuration change. If any BITS transfer is start the same problem comes back :-(
Once again:
- Put two or more VMs on the cluster while all VMs are _shut_down_
- Run all VMs
-> Works well
- Migrate a VM to the Cluster while a 2nd VM (or more) is running or a 2nd BITS transfer is in work
-> Frequently disc access interruption one to 5 minitues long
-> All Cluster Shared Volume customers on the cluster affected
Thank you,
Stefan - Hi Stefan,
Based on your description, it seems normal. As we know, running VM may consume certain system resources. BITS Transfer would also consume many system resources, it would compete for more resource with running VMs and cause hangs.
Please just shut down running VMs when migrating new VMs to reduce system load.
Thanks.
This posting is provided "AS IS" with no warranties, and confers no rights. - Hi,
Do you need any other assistance? If there is anything we can do for you, please let us know.
Thanks.
This posting is provided "AS IS" with no warranties, and confers no rights. - Hi Mervyn,
currently I have a case open at Microsoft Germany. We've collected some performance logs and it seems to be the stroage is overloaded.
Because our SAN System is not really in use and no one of the other systems gets problems like this, i guess its the server hardware itself. I will have another livemeeting right now...
I also tried to handle the r2 Cluster withour CSV and it felt better. But there are the same overload-signs in the performance-logs. Maybe i will install it as 2008 RTM also and log again to have something to compare...
Thank you,
Stefan - Hi Mervyn,
I guess I have it.
In the performance-Logs we've found a issue in the Storage Path. To solve this, I've installed the QLogic Sansurfer tool and configured the Qlogic 8G QLE2562 Fibre Channel HBAs with it. <update> It also does not work with the current QLogic driver 9.1.8.17. So i had to go back to the windows-included driver version 9.1.8.6. </update>
After this everything worked as expected.
Now I have up to 450 MB/s uninterrupted access to my 'Harddisk' out of a VM during a BITS transfer all on a single Cluster Shared Volume.
Thank you for helping me!
Stefan- Edited byStefan Heinz Friday, October 16, 2009 7:51 AM
- Marked As Answer byMervyn ZhangMSFT, ModeratorFriday, October 16, 2009 1:12 AM
- Hi,
Glad to hear you have resolved the problem. If you have more questions in the future, you’re welcomed to this forum.
Thanks.
This posting is provided "AS IS" with no warranties, and confers no rights. OK For all who have same problems in future:
- Open perfmon
- Add a counter (green plus)
- Select counter Physical Disk, % Idle Time, All Instances
The Idle Time should be near 100% during no Access to disk and higher than 70% in average during normal disk access.
Then copy a 2GB file to the SAN disk and look for the counter related to the disk.
If the counter now goes lower than 10% in avg and the transmission rate is not absolutely poor, it seems to be the same problem like mine. In words: You have a problem with the HBA, HBA Drivers and Settings, Fibre Channel Switch and not with the SAN storage.
We've found this problem also on a SQL server now...
Hope this helps,
Stefan- Marked As Answer byMervyn ZhangMSFT, ModeratorFriday, October 16, 2009 11:20 AM

