none
DirectAccess Clients Not Establishing Intranet SA's In A Timely Manner RRS feed

  • Question

  • Hello,

    Our fresh UAG 2010 Sp1 deployment is manifesting strange behavior: DA clients (we're still testing, and have only about 5-6 of them) seem to establish the infrastructure NTLM tunnels with UAG, but the user based infrastructure Kerberos tunnels are not established in time for running logon scripts, mapping drives, or access shares. In some cases, waiting many minutes allows the intranet tunnels to form, at which time I can manually kick off scripts to map drives, etc. However, the user logon process can take a long time, and this SA formation can take 1, or 5 minutes, or sometimes never occur once the user logs on (The NTLM tunnels are always present when looking in wf.msc, Main Mode, though). DA clients could never reboot or log back in and successfully access their drives and shares more than one or two times out of five.

    I've generated a case with MS, and we've been chasing our tails gathering captures, traces and logs. There is nothing special about our UAG setup (2 NIC's), no third party software, and no third party services installed on UAG, except for the VMWare tools, and our organization as a whole is pretty textbook. I've been told by support that our UAG installation is 'clean', and we've moved the test user and PC account to an OU with base minimum GPO's applied- enough for UAG, and a test logon script for mapping drives.

    The other day, I reverted to a snapshot of the UAG VM (which was previously manifesting the "bad behavior" with DA clients), after uninstalling VMWare tools in an attempt to rule them out as a problem and voila! DA clients suddenly were able to reliably and repeatedly log in (I tested literally about 40 times) and access network resources in the intranet and infrastructure tunnels. After 24 hours of testing 'good', I thought we were making progress. However, after installing UAG 2010 SP1 Update 1 (We need to get Lync 2010 working over DA), DA clients began breaking again. A reboot of the UAG server didn't help. Finally, rolling back to the magical snapshot that had worked once before didn't reveal any new results. Clients were back to slowly connecting to UAG, and haphazardly establishing user-side connections.

    I've played with LogonServers, thinking it may be linked to one of our three DC's. No changes. I've submitted Event ID 4653 (IPSec Main Mode Negotiation failures), and other logs to MS and I'm still having issues. I'm not sure what is wrong, nor why for one brief moment in time, everything worked properly.

    Thanks! 

    Monday, October 24, 2011 8:08 PM

All replies

  • Hi Warren,

    Sounds like a really strange situation you are in.
    I will throw out some ideas,

    Check the CAPI2 log to see if there is any problems related to the machine certificates and the verification.

    Are all domaincontrollers listed in the Infrastructure tunnel so they can be reached to verify userlogon and such?

    Check the internal DNS servers so you don't have stale DNS records for the domaincontrollers.

    Best wishes,
    Jonas Blom

    Tuesday, October 25, 2011 9:37 AM
  • Jonas, thanks for replying.

    It is a strange issue; it's the luck of the draw if intranet resources (Kerberos SA's) are mapped and available, or not. The infrastructure SA's always appear in wf.msc, however. Even if the intranet tunnel isn't established, I can still ping everything defined in the infrastructure tunnel, AND can ping servers outside of the infrastructure tunnel- but drive mappings and share access fails in this state. I wish I could establish a pattern.

    I'm embarrassed to say I do not know what or where the CAPI2 log is. Is this found in on the UAG server? Within TMG?

    All three DC's are within the infrastructure tunnel. DNS records are accurate (both A and AAAA records and PTR exist for all three DC's), and I wasn't able to find stale records.

    For what it's worth, we've tested several DA clients, on several ISP's and experienced the same results. Out IP-HTTPS cert is a Digicert, so the CRL is externally available, and the internal PKI servers used for IPSec certs, are also used for RADIUS/wireless clients, and it all appears to work fine, so I'm not sure where to poke and prod around in, if PKI is suspect. If my certs were an issue, wouldn't I receive consistently bad test results?

    Any other ideas?

    Tuesday, October 25, 2011 4:56 PM
  • Hi Again,

    Sorry about that, should have written some info on how to enable it..

    CAPI2 is a log found in Eventviewer
    Look below Applications and Services Logs -> Microsoft ->Windows -> CAPI2
    Right click "Operational" and select "Enable log"


    The reason I suggested this is because it is an easy way to find errors related to certificates and CRL checking.

    My thought was that you would do this on your DA client to see if there is anything there that can be related to the establishment of the SA's and why they time out.


    Best wishes,
    Jonas Blom

    Tuesday, October 25, 2011 7:58 PM
  • Jonas, thank you for the follow up. I may have enabled this log and not known it, since MS support had me enable many logs and perform several captures when my case was opened (It's still open). I'll take a look, and post my findings.
    Tuesday, October 25, 2011 8:14 PM
  • Hi again,

    The only odd item I'm seeing in the CAPI2 log is an Event ID 81, everything else pertaining to Trust verification and Chaining are Informational. Because the process name referenced in this Event ID 81 is 'eventvwr.exe', this message doesn't seem relevant. Should I be looking at a particular block of code in the event? Thanks!

    Tuesday, October 25, 2011 10:30 PM
  • Hi,

    We have seen similar before and in the end these all cleared out.

    Assumeing the UAG setup is correct as it is working you should verify that you NICs are configured correctly on both the UAG end and the Switch end.

    I prefer to fix these on 100 FULL on both ends. Also make sure you check your NIC drivers as these have been the problem in my case.

    If you are using all the types of tunnels verify that the correct ICMPv6 ports are open on your external firewall.

    The reason that you can PING towards all servers is because ICMP is not going through the tunnels, it's excempt from IPSEC. If you want to know if a tunnel is working you can however use: net view \\<servername> You can do this first for a machine in the infrastructure tunnel and then cor one in the corp tunnel.

    Arjan

    Thursday, October 27, 2011 6:48 AM
  • Arjan.

    The UAG server is a VM, running on a VMWare ESXi 4 host. There shouldn't be a speed/duplex issue- everything is auto-negotiating 1 Gbps/Full duplex on each NIC within the UAG Server, and in the vSwitch. (The UAG NIC's are displayed as Intel Pro 1000/MT, using Microsoft's driver version 8.4.1.0 which was assumed from the Windows installation)

    The ESXi 4 host is well connected- having a team of four 1 Gbps/Full connections. (It should be noted that no other guest VM in the cluster is having connectivity issues- but nothing is quite like UAG ;) )

    The fact that our results are hit-or-miss doesn't have me scrabling to look at our firewall- and we've doublechecked out opened ports early on in troubleshooting. However, we'll go over them again. Good point on the net view \\<servername>, we've used this command to access shares while working with Microsoft. Our case is still open with them.

    Thanks!



    Thursday, October 27, 2011 3:47 PM
  • I built a new UAG server, and meticulously walked through the configuration of UAG 2010 SP1 with Update 1. I don't think I missed anything in the prerequisites, nor the installation. And, there were zero issues with the GPO creation and DA activation. Everything went smoothly. However, once policy was applied to the test client (and have a couple of them), we were back to facing the same results as before- intermittent drive mappings, and unpredictable network resource accessibility through Corp tunnels. (And Event 4653 and 4984 in the Security log)

    wf.msc  shows 2-4 NTLM SA's established each time I log on to the test DA client, however the Kerberos SA's (usually see 2) are conspicuously missing. When things are working properly, I see (usually) 2, NTLM and 2, Kerberos SA's on the DA client, at which time I can access resources, map drives, etc. All that separates a "good" and "bad" connection on DA clients is a reboot. It's very odd, and seemingly random. Logon times for "good" connections are about 1.5-2.5 minutes. When it takes 2+ minutes to log on, I know there is going to be an issue with accessing resources and drives.


    Thursday, October 27, 2011 4:43 PM
  • Hi Warren,

    Did you ever get to a solution with MS?


    Regards,
    Anders
    Monday, December 12, 2011 12:13 PM
  • Hi Anders,

    Microsoft has been dedicated to helping us through our issues. We ended up taking the UAG server out of the virtual environment, and moved it to a 1U IBM System X server, where performance has been rock solid for weeks, just as one would expect. I am certain many will have positive experiences running UAG in a Hyper-V, or VMware guest VM, but we needed to make the jump back to a physical server. The VM was great for conceptual testing, and we could cojole it into performing well with DA clients, but we couldn't live with the flaky behavior that may present itself after a reboot or two. It may be our version of VMWare, or other factors, but we simply ran low on time trying to figure it out.

    I've since taken the single server and set up an NLB. It's working, but I can't seem to access a network share from either UAG server (or DA clients) now that we're a cluster. This wasn't a problem with single server mode. I can ping the host, but can't map the share (via name or IP). I'm certain this is a minor issue to iron out, and expect to be completely fine within a day or so.

    Wednesday, February 15, 2012 7:02 PM