답변됨 Hardware failover behavior

  • Tuesday, April 24, 2012 1:31 AM
     
     

    Hi All,

    I'm hoping someone can help me out wih this question

    I've set up a two node test cluster with a CSV (iscsi) volume.  I've created two VMs. One a win7 and the other a W2K server with the remote desktop role installed. I've been testing out the failover features to understand and document them for when we go to production. I have no problems when I do a quick migration. I have set up a terminal session and logged into the remote desktop machine, opened up excel and started working in it. When I perform a quick migration I do not experience any interruptions. The VM migrates from one node to the other. If I simulate a hardware failure (I pull the power cord out) on the node the VM is running on, my session ends. The VM does migrate to the other node and boots up. Is this the normal behavior for a hardware failure, or have I not configured HA correctly in the cluster failover manager? I have run the cluster validation tests and with the exception of a warning about the number of MPIO paths, the cluster validates.

    Thanks,

    Kevin

All Replies

  • Tuesday, April 24, 2012 4:27 AM
    Moderator
     
     
    Hi,

    First of all, a Quick Migration will save the VM, so you will be notice the downtime. I suspect that you performed a Live Migration instead of Quick Migration.

    By the way, Live Migration is designed for planning migration instead of a power failure or hardware failure. If there is a power failure or hardware failure, all the virtual machines will restart on another node or other nodes.


  • Tuesday, April 24, 2012 10:17 AM
     
     Answered

    Hi All,

    I'm hoping someone can help me out wih this question

    I've set up a two node test cluster with a CSV (iscsi) volume.  I've created two VMs. One a win7 and the other a W2K server with the remote desktop role installed. I've been testing out the failover features to understand and document them for when we go to production. I have no problems when I do a quick migration. I have set up a terminal session and logged into the remote desktop machine, opened up excel and started working in it. When I perform a quick migration I do not experience any interruptions. The VM migrates from one node to the other. If I simulate a hardware failure (I pull the power cord out) on the node the VM is running on, my session ends. The VM does migrate to the other node and boots up. Is this the normal behavior for a hardware failure, or have I not configured HA correctly in the cluster failover manager? I have run the cluster validation tests and with the exception of a warning about the number of MPIO paths, the cluster validates.

    Thanks,

    Kevin

    Kevin,

    everything works as expected. HA does not provide any real downtime protection. What you should do - configure guest VM cluster between a pair of VMs hosted by two (or more) different physical Hyper-V boxes. In such a case if one Hyper-V box would unexpectedly go down other VM should raise control immediately (nearly) with very little or no downtime noticed by users. If you cannot go this way for some reason (say your app of choice is not cluster-aware) you need to switch hypervisor to a paid (Essentials and up) versions of ESX as they do have Fault Tolerance. In a nutshell FT is what you expect HA do for you :) Except it has own drawbacks (hardware load, inability to take a VM snapshot so no real VM backup etc).

    Hope this helped :)

    -nismo

  • Tuesday, April 24, 2012 12:58 PM
     
     

    Thanks for the clarification and proper terminology. I wanted to make sure I haven't missed a setup feature for proper hardware failover setup.

    -Kevin

  • Thursday, April 26, 2012 6:38 PM
     
     

    I hope somebody will help me out with the failover behaviour of Geo-clusters (multi site) with SQL Server 2008 R2 and SQL Server 2005 on Windows 2008 R2 :

    Here is what we are planning to build:

    Cluster #1 --> SQL Server 2008 R2 Geo-Cluster

    ================================

    Data Centre #1

    ------------------

    Node 1 & Node 2 (with Instance 1 and Instance 2 respectively)

    Data Centre #2

    ------------------

    Node 3 & Node 4 (with Instance 3 and Instance 4 respectively)

    Cluster #2 --> SQL Server 2005 Geo-Cluster

    ================================

    Data Centre #1

    ------------------

    Node 1 (with Instance 5)

    Data Centre #2

    -----------------

    Node 2 (with Instance 6)

     

    So we would have 6 Instances alltogether, 4 in SQL 2008 R2 Geo-cluster, and remaining 2 in SQL 2005 Geo-cluster. We would be using EMC Recovery Point for our SAN replication. I do not have much experience with SAN replication, but I guess we would have to rely on SAN replication to achieve Geo-cluster failover (correct me if I am wrong).

    I am stuck as neither I have any experience with SAN replication nor do I have hands-on on Geo-Clustering.. which is a problem. Under such scenario, I would have to document the steps needed to build both the geo-clusters as well as the mechanism for failover and failback alongwith DR strategy.. phew!!

    1) I do not know whether SQL 2005 has some issues with its installation on multi-site or do we have to install it differently or not. This is because, for WIndows 2008 R2 multi-site, we need to use  /SKIPRULE switch to skip cluster validation. Do we have to apply any such switch while installing SQL 2005 Geo cluster or can it be done normally?

    2) I do not know the steps to failover/failback, whether they happen automatically in case of

    (a) Node failure in a Data Centre

    (b) Entire Data Centre failure..do you have this documented in your environment? Could you please share the steps or guide me here? What extra steps would I have to do to take care of failover/failback with SAN replication?

    Please assist ..

  • Friday, April 27, 2012 11:17 AM
     
     

    Live migration does not restart the VM because it migrates with the user state, while Fail over always restarts the Services or VM since user state cant be sustained from the failed node memory, its normal


    Virgo