locked
Sporadic issues with network location detection on clients. RRS feed

  • Question

  • Hi all,

    We seem to be having an issue with network location detection.  Basically, clients come out of sleep on site and sometimes identify themselves as outside the corporate network (this can be seen by looking at the DNS show state).  Troubleshooting We've done so far:

    • We do not appear to be having any availability problems with our NLS servers.  They've been up every time checked, and we've checked a lot.
    • The clients that experience this problem will not identify themselves as on the corporate network until rebooted.  You can disconnect every network adapter, or put them back to sleep (or hibernate) and wake, and they still will not identify themselves as on the corp network.
    • When testing the above, I took a network cap disconnecting and reconnecting the adapters.  I do not see any attempt to communicate with or even resolve nls.
    • DA troubleshooting tools (DCA and Windows troubleshooting) appear somewhat useless as they simply identify that the client can't connect to the DA servers on Teredo & HTTPS, when the root problem is really that they should not even be trying to.

    Two things that are peculiar in our environment off the top of my head: laptops have somewhat recently been given a power policy that uses hybrid sleep mode (unusual as this is designed for desktops).  Also, the central group has been deploying the 32 bit DCA to our 64 bit machines recently, where we used to run the 64 bit DCA.  Both of these seem unlikely to be the culprit to me, but my next step will be to test some machines without hybrid sleep as a factor.

    This also seems to have come on recently... I'm wondering if any recent Windows updates or Forefront EP updates could be a factor... but that too, seems like it might be a long shot.

    Has anyone seen anything like this?

    Thanks in advance,

    Ross


    • Edited by RossJG Friday, August 10, 2012 1:51 PM
    Friday, August 10, 2012 1:46 PM

Answers

  • Miraculously, I think we've solved it.  The issue seems to have boiled down essentially to a race condition tying to NCSI location checks and registry policy processing, and I believe it is resolved by the hotfix linked in kb2680464.  Basically, when registry policy processing is updating the registry values for NCSI, if the NLA service goes to do an inside/outside check at that time, it attempts to read those values at the time they're being written to cannot successfully read them, and states it will skip inside/outside checks, goes into what it calls a "suspect" state, and never goes back to normal unless you do certain things that basically reset the NCSI process (restart NLA service, or go into local policy resolution mode and pull group policy, or reboot the machine). Screenshot of a custom trace log I generated below:

    I believe the reason most customers do not encounter the problem nearly so frequently is because we had a non-default setting of "Enabled" set for "Process even if the group policy objects have not changed" within Registry Policy Processing that most customers do not have set (it's not Windows default).  In this scenario, every ~90 minutes when group policy is scheduled to refresh, you're rewriting those registry values.  Pair this with the fact that very often your machine will be due to refresh policy right when it comes out of sleep, and this is the exact time that NCSI checks are happening at a fast pace (they start at one second intervals per adapter and double over time until they reach 1024 seconds, and the transition tunnels are among these adapters making checks if you're offsite) and you can figure that you'll have occasional collisions.  My understanding of registry reads/writes is there are no locks, so it makes sense that it calls the NCSI configuration "invalid" (this is what "Reason: 2" means), as it's probably reading during a change.

    The sad thing about this issue is we've had the case open with Microsoft since back in late September and I told them in my original case Email that there was NCSI stoppage, and no one on the network team (to whom the case was originally assigned) suggested this hotfix.  Over a month into the case, after I did my own experimentation and generated the custom trace log (Group Policy + NCSI) and sent them the above screenshot, a member of the UAG team sent me the link to the hotfix shortly after he was looped in.  I don't mean to gripe, they were all nice enough, but it seems like the network team might have put 2 and 2 together sooner.  In our environment, we've both applied the hotfix and changed the Registry Policy Processing setting because we simply need to have this issue 100% resolved quickly (it's gotten to be pretty high profile as our number of DirectAccess machines has expanded).  So far, so good.






    • Marked as answer by RossJG Wednesday, November 7, 2012 8:46 PM
    • Edited by RossJG Wednesday, November 7, 2012 9:14 PM
    Wednesday, November 7, 2012 8:46 PM

All replies

  • Update: I checked a few more things:

    Using on-site wireless:
    Switching to "local resolution" in DCA fixes problem (as one might expect).
    Selecting the "local resolution" again results in the accessed-thru-DA names being unreachable on site.

    Switching over to cellular tether:
    Local resolution mode: nothing pings within my domain.  Even things listed as exceptions (bizarre).  External sites (e.g. google.com) ping fine.
    Corporate resolution mode:  sites accessed thru DA are reachable, others in my domain set as exceptions do not.  They seem to resolve correctly, though.  Bizarre.  So I'll get "pinging [valid IPv4 address] with 32 bytes of data:" and no reply.

    I've taken wireshark and process monitor caps and may work with MS if no one here's seen this.


    • Edited by RossJG Tuesday, August 14, 2012 2:14 PM
    Tuesday, August 14, 2012 2:10 PM
  • The only thing similar I have seen (though this is usually discovered during implementation, not long after the fact) - if you are using the default IIS 7 splash screen for the NLS website it can cause intermittent problems verifying the NLS server. So for all of my installations now I use a default.htm that says something simple like "This is the NLS server" and the issues go away.
    Thursday, August 16, 2012 8:47 PM
  • I do the same thing (the simple index pages).  I've got our NLS behind a load balancer, so it's even more helpful b/c I use an index page on each load balanced NLS server that identifies the exact server connected to (e.g. "This is NLS-srv1").

    What's really interesting is that this seems to tie to group policy processing on the clients somehow.  I checked client behavior, and like I said, they don't seem to even try to hit NLS at all.  But, I had a workstation exhibiting the problem the other day and put it in a workaround "use local resolution" mode for a while.  To my surprise, I saw it communicate with our NLS server about 90 minutes after that.  I went to the machine and took it out of the local resolution mode, and sure enough, it was in a "cured" state.  I cross-referenced the time I saw it communicate with the NLS server with the system log on the machine, and the NLS communication happened about 1 second after a group policy pull.

    We're looking into group policy processing settings that may be affecting us.

    We do have a new workstation image and some new group policies on our machiens lately, so I'm wondering if there's a connection there.

    Monday, August 20, 2012 4:23 PM
  • On this subject, does anyone know if there's a good way to drill down into NLA problems?  When I look at a machine that had the problem yesterday morning, I see that NCSI events stopped logging right after the machine was booted and pulled policy the previous afternoon.  Unfortunately, I don't know what to make of this.

    The computer behaved normally that afternoon (probably just stayed in the domain profile it was in) and did not exhibit the problem until it was taken out of sleep mode the following morning.  At that time, the firewall was using the private profile.

    I do see that our desktop group has been pushing multiple Windows Firewall settings GPOs (accidentally), but I'm a little doubtful that's related because it's just some individual rules they're pushing and nothing that conflicts with the DirectAccess policy.

    As usual, switching to local resolution and pulling policy cured the machine.  Only then did NCSI events start logging again.

    Is there any way to turn on a higher level of logging for NCSI/NLA/Firewall events?



    • Edited by RossJG Wednesday, September 19, 2012 6:30 PM
    Wednesday, September 19, 2012 1:13 PM
  • Miraculously, I think we've solved it.  The issue seems to have boiled down essentially to a race condition tying to NCSI location checks and registry policy processing, and I believe it is resolved by the hotfix linked in kb2680464.  Basically, when registry policy processing is updating the registry values for NCSI, if the NLA service goes to do an inside/outside check at that time, it attempts to read those values at the time they're being written to cannot successfully read them, and states it will skip inside/outside checks, goes into what it calls a "suspect" state, and never goes back to normal unless you do certain things that basically reset the NCSI process (restart NLA service, or go into local policy resolution mode and pull group policy, or reboot the machine). Screenshot of a custom trace log I generated below:

    I believe the reason most customers do not encounter the problem nearly so frequently is because we had a non-default setting of "Enabled" set for "Process even if the group policy objects have not changed" within Registry Policy Processing that most customers do not have set (it's not Windows default).  In this scenario, every ~90 minutes when group policy is scheduled to refresh, you're rewriting those registry values.  Pair this with the fact that very often your machine will be due to refresh policy right when it comes out of sleep, and this is the exact time that NCSI checks are happening at a fast pace (they start at one second intervals per adapter and double over time until they reach 1024 seconds, and the transition tunnels are among these adapters making checks if you're offsite) and you can figure that you'll have occasional collisions.  My understanding of registry reads/writes is there are no locks, so it makes sense that it calls the NCSI configuration "invalid" (this is what "Reason: 2" means), as it's probably reading during a change.

    The sad thing about this issue is we've had the case open with Microsoft since back in late September and I told them in my original case Email that there was NCSI stoppage, and no one on the network team (to whom the case was originally assigned) suggested this hotfix.  Over a month into the case, after I did my own experimentation and generated the custom trace log (Group Policy + NCSI) and sent them the above screenshot, a member of the UAG team sent me the link to the hotfix shortly after he was looped in.  I don't mean to gripe, they were all nice enough, but it seems like the network team might have put 2 and 2 together sooner.  In our environment, we've both applied the hotfix and changed the Registry Policy Processing setting because we simply need to have this issue 100% resolved quickly (it's gotten to be pretty high profile as our number of DirectAccess machines has expanded).  So far, so good.






    • Marked as answer by RossJG Wednesday, November 7, 2012 8:46 PM
    • Edited by RossJG Wednesday, November 7, 2012 9:14 PM
    Wednesday, November 7, 2012 8:46 PM
  • Hi Ross, thanks so much for your research and post on this. We are having exactly the same problem with Windows 7 Enterprise with SP1 clients and finding your KB link is very helpful!

    For others, this event log message was the only distinctive message we found around this network location detection failure.

    Log Name:      Microsoft-Windows-NCSI/Operational
    Source:        Microsoft-Windows-NCSI
    Date:          2/27/2013 11:05:50 AM
    Event ID:      4028
    Task Category: Inside/Outside detection verification
    Level:         Warning
    Keywords:      (1)
    ...

    Description:
    Inside/Outside detection is suspect


    Tim Miller Dyck PeaceWorks Computer Consulting Waterloo, ON, Canada


    Thursday, February 28, 2013 1:05 AM