none
Port Exhaustion with Dnscache

    Question

  • On two of our Windows Server 2012 servers, we've seen port exhaustion for the DNS Client (Dnscache) service. Our architect noticed that one of our servers couldn't resolve addresses. Checking the event logs, I worked my way backwards to an event in the System log, Source: Tcpip, Event ID: 4266, Description: A request to allocate an ephemeral port number from the global UDP port space has failed due to all such ports being in use. I found Port Exhaustion and Youwhich says “99% of the time when someone has this problem, it happens because an application has been grabbing those ports and not releasing them properly.” and “At this point you really should be contacting us, since finding and fixing it is going to require some debugging.”, which we've done. We have an open support case. The blog also says to run netstat -anob. This showed 32,748 UDP ports used by Dnscache. tasklist /SVC|find /i "dns" showed me the PID to kill, and TASKKILL /PID ProcessID /F. did the trick (stopping the service times out, presumably because of how long it takes to close that many ports). Then the server was back to normal. But I would soon see Tcpip Warning 4266 again, on a second server, three more times. When we captured the netstat output on that server, 22702 UDP ports were being used by Dnscache. Some of the ports appear to be released eventually.

    We already encountered the 2008 R2 DNS service bug that keeps it from running servers with sixty-four CPUs. Is this a problem of scale? These are relatively large servers, and the server where this has happened three times is more than three times the scale of the first server. I'm waiting for escalation to an engineer who can help debug this. In the mean time, I hope posting will help someone else looking for an answer to this, and in gathering data for their Microsoft Support case. I'll follow up as I know more. I expect Microsoft to have an answer or hotfix soon.

    Friday, November 09, 2012 3:13 AM

All replies

  • Hi,

    Thank you for your question.
    I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.

    Thank you for your understanding and support.

    Best Regards,
    Aiden


    Aiden Cao

    TechNet Community Support

    Monday, November 12, 2012 3:06 AM
    Moderator
  • Hi,

    Based on my knowledge, we admit that the Windows DNS Server cannot start with 64 processor cores on Windows 2008. If your scenario meets this, I'm sorry that there is impossible to find the solution for this, and we haven't the fix now.

    Since the fix for the bug should be fixed by the product team, and they confrim that this bug will be fixed in the next version. However, we could not confirm if server 2012 exhaust the port due to this bug or not.

    And, I think you can get any conclusion from the Professional Email Support due to this analysis need to be transferred to the debugging Team. Here is the link for your reference:

    http://support.microsoft.com/

    I hope to get your understanding, and if you have any concern, please feel free to let me know.

    Thanks.

    Best Regards,

    Annie

    Monday, November 12, 2012 9:24 AM
  • I work with Greg.  Thanks for your reply.  We identified that bug with dns.exe on Windows 2008 R2 servers several years ago.  We upgraded to Windows 2012 in part to fix that bug.  

    On our Windows 2012 server, we now have another problem that I believe is unrelated.  The dnscache service uses too many ports until we get port exhaustion - for example, this morning it had 22726 ports in use, and I could not open things like domain.msc due to odd errors that it couldn't contact the domain for information.  After killing the dnscache pid, I can open domain.msc without problems.  Of course the symptoms are highly varied and widespread.

    Our support case has been escalated to a "debug engineer", who has not contacted us yet.  Wish us luck!

    Wednesday, November 14, 2012 2:08 PM
  • Update on this: It may not be DnsCache causing the port exhaustion.  When we have the port exhaustion symptoms, "netstat -anob" shows a ton of UDP ports (20k+) used by DnsCache, but that PID is actually shared with several other services.  I've now split them into their own services and am pending a reboot this weekend so I can start monitoring which service is truly at fault.
    Friday, November 30, 2012 2:54 PM
  • Update: I split out all the services into their own PIDs, and the next time port exhaustion occurred I could definitively pinpoint it to DnsCache.  I took a Task Manager dump of the service, a netstat -anob, and a DNS ETL trace, and sent them off to MS for analysis.  1.5 weeks later, still waiting to hear back, but when I sent an earlier set of these files they were beginning to suggest I configure reverse lookup zones.
    Saturday, December 22, 2012 10:45 PM
  • Update: I split out all the services into their own PIDs, and the next time port exhaustion occurred I could definitively pinpoint it to DnsCache.  I took a Task Manager dump of the service, a netstat -anob, and a DNS ETL trace, and sent them off to MS for analysis.  1.5 weeks later, still waiting to hear back, but when I sent an earlier set of these files they were beginning to suggest I configure reverse lookup zones.

    Any news regarding this issue?
    I have just experienced the same issue on two our of three Hyper-V hosts.

    Best Regards
    Karsten Hedemann
    Thursday, March 21, 2013 12:26 PM
  • The server in question was suffering from a SQL 2012 bug that maxed out all CPUs on NUMA node 0, which caused problems with other things like Resource Monitor / perfmon, Disk Management, etc.  When we got NUMA node 0 CPU down to a reasonable level, those other problems went away, and so did our port exhaustion problem.  My hypothesis is that DnsCache fails to release ports when NUMA node 0 is under high CPU load, but does fine under normal loads.

    Or maybe it was coincidence, and some Windows Update fixed it for us around the time our CPU got under control.

    Thursday, March 21, 2013 1:14 PM
  • Have you got any answers on this bug?, I experience the same bug as you describe and your "fix" worked to fix the problem

    /jonas

    Sunday, August 25, 2013 9:56 PM
  • No answers, though it still seems most likely to be a bug in the dnscache service under high CPU conditions.  The MS escalation engineer wanted us to run an "iDNA trace" using a tool called TimeTravelTracing (tttracer.exe).  That tool can only handle < 32 CPUs, so it wouldn't run on our 64-core box.  We were then going to try a handle trace, but by then our high NUMA node 0 CPU was gone and we couldn't reproduce the problem.

    If you can reproduce the problem, I would be happy to help you get in touch with the engineer who worked this case.  I'll reach out via LinkedIn.

    Monday, August 26, 2013 2:37 AM
  • Hi Ethan.

    I got this answer from Microsoft Support

    After doing some research, I find the most possible cause of this issue is the bug of the LLMNR listener. To resolve this issue, we can turn off the LLMNR listener by editing the following register key values:

    Note:

    The following contains information about modifying the registry. Before you modify the registry, make sure to back it up and make sure that you understand how to restore the registry if a problem occurs. For information about how to back up, restore, and edit the registry, click the following article number to view the article in the Microsoft Knowledge Base:

    Description of the Microsoft Windows Registry

    http://support.microsoft.com/kb/256986

    1.      Go to the “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters\MulticastResponderFlags

    And

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Dnscache\Parameters\MulticastSenderFlags

    Set the values to be 0x1

    After rebooting the server, this issue should be resolved.


    /jonas

    Monday, August 26, 2013 9:59 AM