Hello Folks!! In this article I am going to explain a situation that I faced on my domain controllers running windows 2003 SP2 / windows server 2003 R2.

 

Situation:

We were facing slow logon issue on Domain controllers when we were performing interactive logon. Logon was taking 20 – 40 minutes approx. which was a clear sign of problem on the domain controllers.

Investigation:

The first thing most of us do, to check event log for any sign of issue. But I was discouraged when I found none relevant in event logs. I checked health of my domain controllers using the all popular tools including performance monitoring of system, but everything seems OK. Luckily another day I noticed that my monitoring solution (HP OVO) was facing issue while fetching some of the data from the domain controllers. e.g it was not able to fetch “tasklist” command. I tried to run this command manually on one of the domain controller and to my surprise I get error “WMI memory quota violation”. I then tried to run “systeminfo” command which also uses WMI to fetch the information from system, and I got the same error. So it means all the command or tool will fail that uses WMI to fetch the information.

Error:

The following query was run from the cmd-

>c:\ wmic csproduct get name

The error returned was:

0x8004106C
Description: Quota violation.

Now I understand that WMI is facing memory quota exhaustion, which could be the cause of slow login into the system.

Troubleshooting:

WMI has a reserve amount of memory allocated, thus its exhaustion won’t be seen in performance logs of system. On Windows XP and Windows Server 2003 systems, the WMI memory quota per host is 128MB. For Windows Vista and Windows Server 2008 systems the memory quota per host is 512MB.

I referred to the article to increase the default quota –

http://blogs.technet.com/b/askperf/archive/2008/09/16/memory-and-handle-quotas-in-the-wmi-provider-service.aspx

After I've increased the WMI memory quota to 512 MB, I found slow login issue disappeared along with quota violation issue.

Well my happiness didn’t lasted long :(  quota violation issue reappeared again couple of hours after system restart, but login was working fine, no lagging noticed, which means slow login issue was resolved but still there is something which is eating WMI memory quota.

Next job was to find the reason behind memory exhaustion and the perfect way to trace the thing is to take dump of WMI process “winmgmt” and “winprvse”. As winmgmt service runs as a shared service under svchost.exe, we had to isolate the service.

Before isolation-

 

 
To isolate the wmi service –

Run the following command-

sc config “winmgmt”  type= own

 

You can also refer following article for svchost troubleshooting-

http://blogs.technet.com/b/askperf/archive/2008/01/11/getting-started-with-svchost-exe-troubleshooting.aspx

Now after service has been isolated I took the process dump using process explorer when the server was in problem state. Process dump was taken for the "winmgmt" service which we have segregated from the svchost and the other one is "winprvse" service.

NOTE: While taking the winmgmt process dump, please check the PID number of the winmgmt svchost in the task list (tasklist /svc) and compare in the process explorer with svchost which has the same PID and then take the dump, because in process explorer you won't see winmgmt process as it runs in svchost, you'll only see svchost process and that’s the reason you have to identify the svchost having isolated winmgmt process.

After debugging I found that most of the memory usage in winmgmt is from the fastprox heap, used by WMI to marshal data between WMI clients and the WMI service.  Looking further into what WMI is doing, there was 435MB of data for an event notification query from process ID 3900.  The query is to be notified of every event that gets logged to the security log, which can be an extremely expensive on a domain controller.  Apparently this is being hitting more frequently than process 3900 is able to process the data.

I checked that the process was dllhost.exe using an account for log collection and was making query "SELECT * FROM __InstanceCreationEvent  WHERE TargetInstance ISA 'Win32_NTLogEvent' AND (TargetInstance.Logfile='Security')". Such a query is going to be hit very frequently on a domain controller.


Resolution:

I traced that account and found that our central security team was using that account to collect security logs from each domain controller to a central server. Now that make sense, central log collector was using WMI query notification to collect the logs from every domain controller, and as the security log size was 400MB, the query was causing WMI memory bottleneck. That I confirmed after I asked security team to disable one of the server from their log collection, and as soon as they disabled one of the server from log collection, WMI quota violation issue get resolved.

Luckily they were implementing another log collection system, so I didn’t had to perform extra measures to fine tune WMI. But you have to fine tune your WMI notification queries to avoid such nightmares :)

 

Cheers for success!!