Odeslat dotazOdeslat dotaz
 

OdpovědětIs clustering suitable for this scenario

  • 27. června 2009 1:51peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     

    Hi,

    Not having any experience with clustering I thought I would ask this basic question before getting on to, what looks like, a steep learning curve.

    We have a specialised industrials application developed in VB.net running on Server 2008 and SQL Server 2008 (Standard Edition). All users run under Terminal Services and we generally run about 40 simultaneous sessions most of which a fairly high activity data capture sessions in the factory. The system creates about 40 new transactions per minute into the database. We currently use two HP Prolient servers (App Server and Data Server) each with two dual core 3.00Ghz Xeon processors and 4Gb memory. One server supports the application and the other is dedicated to the database however the database itself uses transactional replication to maintain a backup copy on App Server. In the event of a catastrophic server failure we can run on one server and may loose a few transactions and the change over time is about 45 minutes.

    What we would like to do is offer a dual node active / passive cluster in which we could say there is no single point of failure and, in the event of the active node going down, the passive node would cut in without the users being aware of the failure. We would also want the same level of confidence with the database, i.e. a second instance that is always up-to-date with automatic failover to the second instance should the primary instance fail.

    So, the questions:

    1) Does clustering offer the capability to do what we want?

    2) If it does, what are the key configuration issues we need to focus on?

    3) Can we head down this path knowing the we don't have to change our application or SQL code?


    TIA, Peter

Odpovědi

  • 30. června 2009 3:54WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Odpovědět
    Peter

    you would need to create NLB nic for each and you can simulate the same connection you have today, then in nlb you set afinity to single meaning a single IP address will stay on the same host and then you shoudl be good
    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 30. června 2009 3:54
    • Označen jako odpověďpeter jones 30. června 2009 4:02
    •  

Všechny reakce

  • 27. června 2009 8:53Fouad Buhawia Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď
    Hi Peter;
              
     I Think Ya ..
    Bette You Read This;
    http://msdn.microsoft.com/en-us/library/ms952401.aspx

    Then Try This;
    http://msdn.microsoft.com/en-us/library/ms179530.aspx


    But You Have To Be Careful In Dealing With This Matter ( You Are Saying Not having any experience with clustering )

    Also; You May Find Through Google Steps To Do So ..
    Hope It Helps
  • 27. června 2009 15:32David Bermingham Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď

    Failover clustering will do the job, but there are a few things to consider.  If you want to maintain redundant copies of the SQL data, you will need to deploy host based replication or array based replication.  MSCS/WSFC does not support transactional replication as a replication mechanism.  There are a few options here, including the one from my company, SteelEye DataKeeper.  Here is a video which illustrates the SQL configuration.

    http://www.steeleye.com/downloads/resource/videos/datakeeper-for-sql/index.html

    To protect the App server without making changes to your code, you will need to use a Generic Application resource.  Have a look at this link some more information on Generic Application resources.

    http://blogs.msdn.com/clustering/archive/2009/04/10/9542115.aspx

    Let me know if you have any questions, I'd be glad to help.


    David A. Bermingham Director of Product Management http://www.steeleye.com
  • 28. června 2009 6:41peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Fauad,

    Thanks for the response. I had already noticed both these links in earlier research and I ignored the first one because the paper was seven years old and refers to Windows 2003. I presume that clustering has moved along quite a bit since then and should only be looking at Windows 2008 papers. Is this correct?

    Cheers, Peter
  • 28. června 2009 7:00peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi David,

    Thanks for taking the time to respond. Actually I had already noticed your company from other posts and bookmarked it for further investigation should we go ahead with clustering. I was aware that transactional replication wouldn't be suitable - I only mentioned it in terms of what we do now. Your second link was most useful thank you and it actually led me to this link: http://blogs.msdn.com/clustering/archive/2009/05/07/9593050.aspx which contains lots of information I hadn't previously seen.

    From what extra I know now I see there are beasts called "cluster aware" and "cluster unaware" applications which I hadn't realised. This is certainly something I need to look in more details as we certainly fall into the "cluster unaware" category. I have just finished looking at a webcast that was interesting but one thing was very confusing and you may be able to help me with understanding what was said. Essentially, the presenter said (I think): "If Node 1 fails and the workload packages that where exectuting are gracefully shutdown and restarted on Node 2"

    Assuming the Node 1 failed with a fried CPU how can the workload packages be shutdown? Node 1 effectively no longer exists to close anything down.

    Cheers, Peter
  • 28. června 2009 15:10David Bermingham Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    You are absolutely right Peter.  If your primary node has a catastrophic failure, there is no gracefull shutdown.  What presentation were you looking at?
    David A. Bermingham Director of Product Management http://www.steeleye.com
  • 28. června 2009 22:16peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi David,

    This one: Failover Clustering 101 (about my level I thought)

    http://msevents.microsoft.com/CUI/WebCastEventDetails.aspx?EventID=1032364830&EventCategory=5&culture=en-US&CountryCode=US

    Maybe I've started off with the wrong expectation with clustering. I was thinking it gave Automatic Teller Machine level failure support, i.e. no matter what the failure from the user's perspective everthing carries on working as normal. From further reading last night of the 2003 link that Fauad provided I'm now left with the understanding that clustering (in 2003 anway) is targetted at "High Availability", e.g. no single point of hardware failure and the ability to get up and going again on another node reasonably quickly but instantaneously. Is this correct?

    Cheers, Peter
  • 29. června 2009 2:03David Bermingham Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    MSCS and WSFC will protect you against hardware and software failure, however, as you mentioned switchover will generally take a few seconds.  Depending on the application, the client may or may not notice the switchover.  The alternative is fault tolerant systems, such as the HP blades described here...

    http://searchdatacenter.techtarget.com/news/article/0,289142,sid80_gci1317646,00.html

    You will pay a price for fault tolerant hardware, and I'm not exactly sure what fault tolerant hardware does for you if the application itself fails or if you need to schedule maintenance.  I think both of those scenarios mean your application is unavailable.
    David A. Bermingham Director of Product Management http://www.steeleye.com
  • 29. června 2009 13:35WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď
    Peter

    you might actually be better of without clustering in this case but rather using SQL mirroring and to terminal servers in hyper-v

    I would tacle this with SQL 1 primary, SQL 2 mirror backup, SQL 3 (workgroup) mirror monitor,  TS 1 and TS 2 so that oyu have redundancy on the TS side as well.

    just a thought, but this will give you a lot better SQL reduandancey that SQL in a cluster

    will

    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 29. června 2009 13:35
    •  
  • 29. června 2009 21:59peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Will, Thanks for the input. I certainly see what you mean re SQL but I'm not sure what you mean re TS1 and TS2. I presume its something along the lines of setting up our application server in in virtual machine environment and copying the image of that machine onto another server that would be a cold/warm? standby. Is this correct? Cheers, Peter
  • 30. června 2009 2:43WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď
    Peter

    for TS1 and TS2 I mean Terminal Serve 1 and 2 place each on a different hyper-v parent and use NLB for the fail over on them.
    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 30. června 2009 2:43
    •  
  • 30. června 2009 3:06peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Will,
    Yes - I presumed that's what you meant. Can I just flesh this out a little bit please:

    1) By "different hyper-v parent " are you refering to two different physical servers?

    My other questions assume the answer to 1) is yes in that we need hardware redundancy. The application itself is very stable and doesn't fail - it is only catastrophic hardware failure that is a concern.

    2) I presume NLB is "network load balancing" - doesn this mean NLB can sense the failure of TS1 and somehow bring TS2 online?

    3) Would we need a full set of Windows Server and Terminal Services licences for both TS1 and TS2 or could we use the same licences for each machine given that only one is 'active' at any one time. 

    Cheers, Peter

  • 30. června 2009 3:12WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď
    Peter

    1) yes we put everything on hyper-v

    2) NLB can be either active/active or you can just use different icons to connect to the servers., so TS1 and TS2 usually are both online and sharing the session load.

    3) Licensing that all depends on your agreements but the TS CALS are device or user base it does not matter how many servers, servers is another storey, SQL though the passive node does not require a licesnces usually best is to check with your local partner

    we can go into more details is you like

    will

    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 30. června 2009 3:13
    •  
  • 30. června 2009 3:33peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Thanks Will,

    Let me just get the scenarios clear in my head then:

    1) So we have a SQL backend that is mirrored.

    2) We have two physical 'application servers' (TS1 and TS2) sharing the load but both are using the same SQL instance and sharing the same database.

    3) Catastrophy 1 - the SQL backend goes down. I presume the mirroring will transparently handle rerouting database requests to the mirror and the application doesn't know this has happened. Is this correct?

    4) Catastrophy 2 - TS1 goes down. All terminal service sessions on TS1 are lost and the users must login again and this time they will be supported by TS2. Is this correct?

    5) When users login they have no idea, nor do they care, if its TS1 or TS2 that is hosting their session. Is this correct?

    Cheers, Peter 
  • 30. června 2009 3:38WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Navržená odpověď
    yep pretty much, you can take based on your level, as you could have sql1 and ts2 on the server server and sql2 with ts2 sharing the hardware.

    you will need a small thirs server runnign sql workgroup for the mirroring monitor.

    will

    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 30. června 2009 3:54
    •  
  • 30. června 2009 3:51peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Will,

    Thanks for taking the time to help. A last question regarding the network:

    The way we handle things at the moment is all our users in the plant are connected to a dedicated 'factory' NIC in App Server and all the 'office' users come in on another NIC. Is there some type of device we need that takes these network connections and plugs them into TS1 and TS2 simultaneously? Just trying to understand how the logins are distributed between the TS1 and TS2.

    Cheers, Peter
  • 30. června 2009 3:54WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     Odpovědět
    Peter

    you would need to create NLB nic for each and you can simulate the same connection you have today, then in nlb you set afinity to single meaning a single IP address will stay on the same host and then you shoudl be good
    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    • Navržen jako odpověďWPJB 30. června 2009 3:54
    • Označen jako odpověďpeter jones 30. června 2009 4:02
    •  
  • 30. června 2009 4:04WPJB Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Peter if you need anythign else make sure you ping me..
    will

    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
  • 30. června 2009 4:05peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     

    Ah - I've just looked up NLB on MSDN and I see now its a technology within clustering - I hadn't realised that before.

    Thanks for all you help Will - I'm sure this has really got me along the right path.

    Cheers, Peter

  • 2. července 2009 10:29Edwin vMierloMVP, ModerátorUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     

    Lets break down the questions here

    1) can I make SQL highly available ?

    Yes, you can, SQL can be clustered with Windows 2008 enterprise, which will give you a highly available database.
    However there are some caviats:
    - you need some "shared storage" this is storage connected to both nodes of the cluster for SQL to store its database files. Usually you will see iSCSI or FC SAN's deployed.
    - Clustering is not "always on" or fault tolerant, in case of a node crash it will not gently shutdown, but it will restart on the other node with the same IP/Name so clients can re-connect. But there is a period (seconds to 1-2 minutes) where the SQL service is not running
    - based on the last point of re-connecting, it will require your client application to have a "re-try logic" after "timeout" of the connection, so that your client can reconnect to the re-started SQL server on the other node. Also the code need to be transactional, as after a crash it will roll-forward-and-back transactions once it restarts on the surviving node.

    2) Can I make my client highly available ?

    Yes, you can, as mentioned before you can run your Terminal services in a Virtual Machine. And you can cluster your VMs, which now makes them highly available. Still when a physical server crashes or for whatever reason the VMs need to move from one node to the other, it is not done instantly, there is a small outage before your users can connect to the RDP sessions again.

    3) Can I do this without code changes ?

    Well that depends: does your client application has "re-try" logic to reconnect after a connection fails ? if so, is this configurable, e.g. set it to 2 minutes of re-trying ? and is your SQL code in the client fully transactional, e.g. can it survive a sudden reboot of the SQL server without corrupting your data ?
    Bottom line is: I would not assume that you can do this without code changes, I would carefully do a code review for this.

    And after considering this, there are more options
    - Database mirroring
    - third party applications which can cluster/mirror (you can always count on David B to put a steeleye pitch in this forum :-)

     

    Hope this helps

    Rgds,

    edwin.

     

  • 2. července 2009 21:44peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Edwin,

    Thanks for the further input. All our apps use a central data access layer (DAL) for database io which already has retry logic so we are ok there and all updates run in transactions so we are ok there. Now that I understand the clustering is not designed to give instant failover I'm becoming more comfortable with what we can and can't achieve. One thing that is presenting a conceptual problem is the use of virtual machines. What I would like to do is really understand you comment:

    "Yes, you can, as mentioned before you can run your Terminal services in a Virtual Machine. And you can cluster your VMs, which now makes them highly available"

    Are you able to point me to further reading that explains the advantages of clustering the app in VMs?

    Cheers, Peter
  • 3. července 2009 8:30Edwin vMierloMVP, ModerátorUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Peter,

    First lets dwell on this "instant" failover. In general, the couple of seconds it takes to stop the service on one Node and start it on the other Node (moving the app between nodes) is generally acceptable even for the most critical applications. Because it is still down for a couple of seconds, this is called "high availability". In even more critical applications, e.g. life/death systems, you probably want "fault tolerance". This can be achieved by fault tolerant servers, and example you will find here: http://www.stratus.com/products/ftserver/
    Be sure you have your heart-pills ready and sit down when they hand you the quotation ! ;-)

    Now onto your last question: "Are you able to point me to further reading that explains the advantages of clustering the app in VMs?"

    First of, I was not talking about clustering the Application, it seem to me that this is a client application which is probably not suitable for clustering.
    I was talking about clustering the server where you are running terminal services, and where multiple instances of this application are running.

    Currently your users log on to a Terminal service, and then run the client application.
    So, why not virtualize this server.

    Once virtualized, cluster the Virtuam Machine running Terminal Services, so you have high availability for this.

    Does that answer your question ?

    rgds,
    Edwin.

    no time to search for documentation at the moment will post in the next couple of days or so. Meanwhile do a search for "Hyper-V cluster".
  • 3. července 2009 8:49peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Edwin,

    Do I take you are saying have one physical machine that has two (or more) cluster nodes each running in a VM with terminal services running in the nodes (or something similar). If so, then it doesn't give us what we need - the application itself is very stable and basically doesn't fail - what we are trying to cover is the catastrophic failure of a physical server so our two cluster nodes need to be on different physical servers.

    Cheers, Peter
  • 3. července 2009 8:54Edwin vMierloMVP, ModerátorUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    That is not what I was saying

    I mean

    2 physical machines - this is where the cluster is

    running multiple VM's capable of moving between nodes

    each VM running a full OS, and Terminal services - this is where your users connect and run the app.


    if one of the physicals crashes all VM's will move to the surviving physical (after a couple of seconds down, remember it is a cluster)

    rgds,
    edwin.

  • 3. července 2009 9:10peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Edwin,

    Thanks for the clarrification. I guess what I'm not understanding is the benefit of running in VMs - is there something intrinsically beneficial about VMs or is it just the fact that  its the VM that failsover and that makes the whole process simpler?

    Cheers, Peter
  • 3. července 2009 13:03Edwin vMierloMVP, ModerátorUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Now there is a question which you can write a book about...

    Both from Admin and User perspective

    if the user always connect to "MyTS-1" then with clustered VM's this is always up, regardless on what node it is running
    If that would be a physical server, then it could be down if you have a physical problem. You need to tell the user to connect somewhere else, with all the questions.

    Also, the "recovery" of an outage is automatic, it fails over to the surviving node without human intervention (and without you getting the phone calls on your day off)
    Again, from a user perspective the "server" is running, they don't know where it is running, just that it is running !

    From an Admin perspective you can do physical maintenance on a server, not having to come in on the weekend.... you move the VM to node2 (users will not know where this is running) and you can powerdown physical node1... do you maintenance on a week day, and go sailing in the weekend ;-)

    probably good whitepapers discussing all this somewhere

    rgds,
    edwin.

  • 4. července 2009 2:33peter jones Uživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Hi Edwin,

    Thanks for all your comments - very educational.

    It looks as though I'm going to get aboard the good ship "steep learning" curve as I'm now pretty confident clustering will deliver what we want from a business perpective.

    Cheers, Peter
  • 4. července 2009 11:48Edwin vMierloMVP, ModerátorUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaileUživatelské medaile
     
    Peter,

    you're welcome !

    One last thought: High Availability is more than just "slapping a cluster together".
    Proper High Availability includes people, processes, disaster recovery, support contracts, and last but not least training !

    Consider all of those, although they all require an investment either time and/or budgets, it will pay off in the long run !

    Good Luck with your project !
    Rgds,
    Edwin.