locked
Hyper-V Disaster Recovery Questions RRS feed

  • Question

  • Hello Experts,

    I have 3 nodes cluster , now i am in the process of creating document for cluster "disaster recovery" process in case of any emergency.need answer of these question.

    1. What is the exact understanding of heart beat?like if heart beat checks using ping response if yes how many missing ping on cluster thing that node is down ?

    2. What should be the first remedy when maximum nodes down like nodes OS corrupted and cluster breaks?

    3. What are the restore steps when cluster was broke and maximum nodes are back online?

    4. At what time was RUN cluster validation wizard?

    5. Can cluster break if I rebooted two nodes at same time and they are unable to response like stuck on some where before boot process.

    Please help me for above scenario based questions also suggest and help to complete document.

    Wednesday, October 5, 2016 7:13 AM

Answers

  • Hello,

    1) some reading about Heartbeats: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/  

    4) Cluster validation should be run on every changes in your cluster environment. You should also run cluster validation when deploying new cluster. Otherwise your cluster would not be supported by Microsoft (Non-validated clusters are not supported).

    Other questions: I am not sure. But you should allways have backup solution deployed for those events when everything go wrong. Cluster is just for high availibility, disaster recovery plans are mostly when you have to start from scratch :)

    Radek

    • Proposed as answer by Leo Han Wednesday, October 12, 2016 5:39 AM
    • Marked as answer by Osama-Mansoor Thursday, October 13, 2016 5:43 AM
    Wednesday, October 5, 2016 12:18 PM
  • I have 3 nodes cluster , now i am in the process of creating document for cluster "disaster recovery" process in case of any emergency.need answer of these question.

    Clustering is for high availability, not disaster recovery. If you are working on your disaster recovery documentation, it should have much more content on backup and restore. If you're using it, Hyper-V Replica should also appear prominently in DR documentation. Clusters are trivial to rebuild.

    1. What is the exact understanding of heart beat?like if heart beat checks using ping response if yes how many missing ping on cluster thing that node is down ?

    Heartbeat is handled in SMB packets, as are all inter-node communications. By default, they are transmitted once per second. Cluster responses are configurable as per previously linked articles

    2. What should be the first remedy when maximum nodes down like nodes OS corrupted and cluster breaks?

    This question is very vague and difficult to answer correctly. If you lose one node, get it back online ASAP. If you don't have enough nodes to start the cluster, my general recommendation is to reconfigure the cluster to accept a reduced quorum requirement because that's non-destructive. If you're just in a hurry, then "Start-ClusterNode -FixQuorum" from an elevated PowerShell prompt will cause the node that you run it from to start up immediately regardless of quorum, and it will overrule any other nodes' treatment of quorum. A last resort in a worst-case scenario is to evict downed nodes and then rejoin them when they're available.

    3. What are the restore steps when cluster was broke and maximum nodes are back online?

    Clusters will recover on their own when they have sufficient nodes to reach a quorum. If this ever doesn't happen, then just manually start the nodes in Failover Cluster Manager or use "Start-Cluster".

    4. At what time was RUN cluster validation wizard?

    If you're asking how you know when a cluster validation was run, then all results are contained in C:\Windows\Cluster\Reports on every node. If you're asking when you're supposed to run it, you should absolutely run validation after building the cluster and you should run it again when making any hardware changes, including adding nodes. I have never been asked to produce a validation report, probably because they get so much more information out of MATS. If PSS does need one, you can run it during the course of a case. Be advised that a complete validation causes interruption to storage, which will impact virtual machines.

    5. Can cluster break if I rebooted two nodes at same time and they are unable to response like stuck on some where before boot process.

    The cluster will go down if all nodes are offline, of course, but bringing enough of them back online will return the cluster to normal operative state. My test cluster has two nodes, and each node runs a local, non-HA domain controller. Any time I leave for vacation, I fully power down the entire environment. When I return, I turn everything back on. By the time I get the car unloaded, my environment is completely operational.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    • Proposed as answer by Leo Han Wednesday, October 12, 2016 5:38 AM
    • Marked as answer by Osama-Mansoor Thursday, October 13, 2016 5:43 AM
    Wednesday, October 5, 2016 1:19 PM

All replies

  • Hello,

    1) some reading about Heartbeats: https://blogs.msdn.microsoft.com/clustering/2012/11/21/tuning-failover-cluster-network-thresholds/  

    4) Cluster validation should be run on every changes in your cluster environment. You should also run cluster validation when deploying new cluster. Otherwise your cluster would not be supported by Microsoft (Non-validated clusters are not supported).

    Other questions: I am not sure. But you should allways have backup solution deployed for those events when everything go wrong. Cluster is just for high availibility, disaster recovery plans are mostly when you have to start from scratch :)

    Radek

    • Proposed as answer by Leo Han Wednesday, October 12, 2016 5:39 AM
    • Marked as answer by Osama-Mansoor Thursday, October 13, 2016 5:43 AM
    Wednesday, October 5, 2016 12:18 PM
  • In addition to what Radek says, #2 is a very broad question.  It relates to the types of applications you are running, what sort of backup solution you are running, what recovery time objectives you need, and a host of other things.  It is much too complex a question to try to address in a technical forum.  Use your favorite search engine to do some research.  There are lots of blogs and articles about disaster recovery that will help you formulate your own plan.  Or, hire a DR consultant to assist you.

    #3 - what do you mean by 'cluster was broke'?  You have not provided enough background information for us to offer a possible answer.  It could be as simple as rebooting the cluster, or as complex and restoring from backup. 

    #5 - again, what do you mean by 'cluster break'?  If you have a three node cluster and two nodes are down, the cluster is most likely unavailable, unless you have forced quorum manually.  So the other two nodes getting stuck in a boot process would not have any effect on the existing node.  More details needed to provide any reasonable answer.


    . : | : . : | : . tim

    • Proposed as answer by Leo Han Wednesday, October 12, 2016 5:39 AM
    Wednesday, October 5, 2016 12:57 PM
  • I have 3 nodes cluster , now i am in the process of creating document for cluster "disaster recovery" process in case of any emergency.need answer of these question.

    Clustering is for high availability, not disaster recovery. If you are working on your disaster recovery documentation, it should have much more content on backup and restore. If you're using it, Hyper-V Replica should also appear prominently in DR documentation. Clusters are trivial to rebuild.

    1. What is the exact understanding of heart beat?like if heart beat checks using ping response if yes how many missing ping on cluster thing that node is down ?

    Heartbeat is handled in SMB packets, as are all inter-node communications. By default, they are transmitted once per second. Cluster responses are configurable as per previously linked articles

    2. What should be the first remedy when maximum nodes down like nodes OS corrupted and cluster breaks?

    This question is very vague and difficult to answer correctly. If you lose one node, get it back online ASAP. If you don't have enough nodes to start the cluster, my general recommendation is to reconfigure the cluster to accept a reduced quorum requirement because that's non-destructive. If you're just in a hurry, then "Start-ClusterNode -FixQuorum" from an elevated PowerShell prompt will cause the node that you run it from to start up immediately regardless of quorum, and it will overrule any other nodes' treatment of quorum. A last resort in a worst-case scenario is to evict downed nodes and then rejoin them when they're available.

    3. What are the restore steps when cluster was broke and maximum nodes are back online?

    Clusters will recover on their own when they have sufficient nodes to reach a quorum. If this ever doesn't happen, then just manually start the nodes in Failover Cluster Manager or use "Start-Cluster".

    4. At what time was RUN cluster validation wizard?

    If you're asking how you know when a cluster validation was run, then all results are contained in C:\Windows\Cluster\Reports on every node. If you're asking when you're supposed to run it, you should absolutely run validation after building the cluster and you should run it again when making any hardware changes, including adding nodes. I have never been asked to produce a validation report, probably because they get so much more information out of MATS. If PSS does need one, you can run it during the course of a case. Be advised that a complete validation causes interruption to storage, which will impact virtual machines.

    5. Can cluster break if I rebooted two nodes at same time and they are unable to response like stuck on some where before boot process.

    The cluster will go down if all nodes are offline, of course, but bringing enough of them back online will return the cluster to normal operative state. My test cluster has two nodes, and each node runs a local, non-HA domain controller. Any time I leave for vacation, I fully power down the entire environment. When I return, I turn everything back on. By the time I get the car unloaded, my environment is completely operational.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    • Proposed as answer by Leo Han Wednesday, October 12, 2016 5:38 AM
    • Marked as answer by Osama-Mansoor Thursday, October 13, 2016 5:43 AM
    Wednesday, October 5, 2016 1:19 PM
  • Hi Osama,

    You could mark the reply as answer if it is helpful.

    Best Regards,

    Leo


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, October 12, 2016 5:39 AM