none
RDS NLB cluster problems RRS feed

  • Question

  • We had an issue with NLB in combination with our physical servers, let me explain our environment and the issue we are experiencing;

     

    We have the following setup;

    - 3 systems with combined roles of RD Web Access and RD Gateway, 2 Virtual and 1 physical.

    - 1 RD Connection Broker, virtual

    - 1 Farm with 3 systems with the RD Session Host role, 2 physical machines and 1 Virtual, i will call this Farm1

    - 1 Farm with 2 systems with the RD Session Host role, 2 Virtual Machines, i will call this Farm2

    - The Farm names correspond with the dns name of the NLB cluster (with corresponding dns record).

    - The servers and our users don't have direct Internet access.

    - The servers are not cloned, all are a created with a full manual install.

    Users login to the RD Web Access, which queries RDCB, which queries the farms, which populate the RD Web Access and where users start the RemoteApps

     

    This configuration has been in place for 6 months and worked fine, until last week where we had the following issue;

    • One of the physical RDS servers wasn't responding anymore and we couldn't even login on the console. We rebooted this machine, after that reboot the booting process appeared to stop at "applying group policy settings". We have NLB being started/converging on this point (on TMG Array's and on 2008TS servers).

    • Normally, when restarting the server, the NLB gets reconfigured and the re-directions of connections stop to this server. But this time users were still being forwarded to the still (re)-booting server. Even though NLB didn't work on that machine and none of the RDS services were even started. Also, we noticed that the NLB cluster itself, which had 3 hosts, now only consisted of 1 host. The other physical RDS server. The virtual server (wich was in drain mode so it had no active users) was suddenly removed from the cluster.

    • Since this is a production environment, we turned the server off. But the RDCB still was directing new logon sessions to the RDS01 server. Even when turned off.

    • Rebooting the RDCB did not help, normally only active RDS servers (the ones that are ON) join the RDCB, but the RDS01 was still present in the RDCB after the reboot, even though the RDS01 server was turned off.

    • We fixed the issue by shutting down the server and removing the server from the RDCB by removing the server from the registry on the RDCB and restarting the service.

    We wanted to know what went wrong, we disabled the NIC's in the bios and booted te server . Which was the only way we could boot it without it disrupting the nlb. And the only way to boot to the OS was in safe mode, normal mode would still hang at the applying group policies screen. There were no use-full messages in the event logs, only after removing nlb with: ocsetup NetworkLoadBalancingFullServer / uninstall the server would boot in normal mode.

    Since we needed the server back asap so there wasn't much time to dig into this problem, and didn't trust the installation anymore. We reverted the RDS01 to a backup which was made right after the installation of that server en rejoined the cluster. Everything worked again as expected.

    For about 2 weeks, then suddenly the other physical server started do deny logins. No new logins where possible, the cause was that a lot of the services which are configured to start with a networking service account began to crash.

    Since the load (number of sessions) was low on that server, it got all the new sessions directed to it.

    We we're unable to put the server in drain mode because of the crashed services, so we rebooted the server to keep it from getting new sessions and see if it fixed the issue. And it also didn't went any further then “applying computer settings”. The terminal services weren't started yet, but the RDCB redirected the users to it anyway. We turned the server off, the RDCB still kept redirecting to the powered down server. This time a reboot of the RDCB server was enough to stop this.

    This time we saw an eventlog message of which we only have screenshots. And they are in dutch, but the first one is:

    source: Group Policy Folder Options.

    ID: 8196

    User: system

     

    (remember, this is my own translation)

    The extension of the client has found a un-resolved exception filter group expand in:

    'Access violation (0xc0000005) occurred at 0x76f41da0; the memory at 0x01a86018 could not be read.' Look at trace file for more information.

     

    The second one:

    Source: Application error

    ID: 1000

    user: no user

     

    message:

    name off the faulting application: svchost.exe_ProfSvc, version: 6.1.7600.16385, timestamp: 0x4a5bc1

    Name of the module with error: ntdll.dll

    exception code: 0xc0000374

    erroroffset: 0x000000000c6cd2

    path to application with error c:\windows\system32\svchost.exe

    path to faulting module: c:\windows\system32\ntdll.dll

     

    Removing nlb didn't work and the server hung at “configuring windows” after the needed reboot.

     

    We also have 8 physical TS (windows 2008, not R2) in the same kind of config (TS Web, TS SB, TS GW) and this config did not experience this issue. So we investigated further and now think it has to do with the Broacom BACS nic teaming;

    - we use BACS NIC teaming on the 2008TS as well as on the 2008RDS farm.

    - Dell (and Broadcom) tells us that the TOE capabilities don;t work with NLB. But on our RDS servers the TOE capabilities were turned on

     

    We turned the TOE off on the broadcom nic's after the first incidint. But this obviously didn't help.

    Also, Farm2 wich has the exact same configuration doesn't have these crashes. But users are being migrated from farm2 to farm1 and the load on farm2 has never been as big as the one on farm1.

    On a side note, the 2008 TS (32bit and not W2K8R2) farms have less users/server but the total number of users on a single farm is a lot bigger than that on the 2008R2 RDS farm. But the load (memory, cpu and network bandwidth) on the servers isn't an issue. Perhaps the number of connections per server could be an issue?

     

    Anybody any ideas on this matter?

     

    We're currently investigating the BACS versions to see if that could be the cause. I'll post updates as we find them.

    Thursday, May 19, 2011 7:41 AM

All replies

  • HI,

    MS NLB is not Ideal, I would recommend using a third party Application delivery controller like KEMP or F5. KEMP are the more affordable range of ADC's and are simple to setup. Have a look at the following Link: 

    http://ryanmangansitblog.wordpress.com/2013/09/05/load-balance-rds2012-rdwa-and-rdgw-using-sub-interfaces-on-kemps-loadmaster/

    Best regards,

    Saturday, November 2, 2013 11:26 PM