none
Health Service Crashing - Event ID 4000 and RRS feed

  • Question

  • I am getting this error on our SCE installation:

    Event Type:    Error
    Event Source:    HealthService
    Event Category:    Health Service
    Event ID:    4000
    Description:
    A monitoring host is unresponsive or has crashed.  The status code for the host failure was 2164195371.

    Followed with this Warning:
    Event Type:    Warning
    Event Source:    HealthService
    Event Category:    Health Service
    Event ID:    1103
    Description:
    Summary: 1787 rule(s)/monitor(s) failed and got unloaded, 0 of them reached the failure limit that prevents automatic reload. Management group "RSM01_MG". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

    and this one:
    Event Type:    Warning
    Event Source:    HealthService
    Event Category:    Health Service
    Event ID:    1103
    Description:
    Summary: 1 rule(s)/monitor(s) failed and got unloaded, 0 of them reached the failure limit that prevents automatic reload. Management group "RSM01_MG". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).

    --------------------------------------
    I assumed that this was related to the SNMP network devices I added, so I applied this hotfix KB951526
    (per Clive Eastwood) with no joy.  Has anyone experienced this issue?  It seems like I have to restart the health service for 'things' to return to normal.


    Tuesday, August 5, 2008 3:29 AM

Answers

  • Eric, thank you for your help.  We have opened a case with MS Support on the issue. 
    When resolved, I will post the real problem/resolution on this thread... If I remember Smile

    Neale
      
    Tuesday, August 12, 2008 1:08 PM

All replies

  • Hi Neale,

    Did you import any third-party Management Packs before?

    Please Open SCE console, navigate to Administration space, choose "Management Pack",
    Reviewing dependency on the SNMP library MP to check whether there is any third-party Management Packs depending on it.

    If there is, try to delete these third-party MP, and check whether the problem still exists.

    --------------------
    Regards,
    Eric Zhang



    Thursday, August 7, 2008 9:21 AM
    Moderator
  •  

    There was only one 3rd party mp that had a dependency for the SNMP Library MP.  I have removed that and the issue still exists.
    Friday, August 8, 2008 4:20 AM
  • Hi Neale,

    We need to  turn on Watson reporting for your SCE server. Please run "regedit" on your SCE console, expand to


    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService\Parameters

    Change the value of "Error Reports Enabled" from 0 to 1 to enable the Watson.

    Watson reporting is enabled there should be some other events logged giving the bucket id of the crash report.

    Please post them in this thread.

    --------------------
    Regards,
    Eric Zhang






    Monday, August 11, 2008 10:34 AM
    Moderator
  • There is a hotfix available for Event ID 4000 issued by Microsoft not too long ago. Page is here. http://support.microsoft.com/kb/951526

    Not sure if this will help. I havent applied it yet.

     

    Monday, August 11, 2008 3:47 PM
  • Eric, thank you for your help.  We have opened a case with MS Support on the issue. 
    When resolved, I will post the real problem/resolution on this thread... If I remember Smile

    Neale
      
    Tuesday, August 12, 2008 1:08 PM
  • Hi Neale,

    Have you solved this problem? I have the same. I monitor six HP switches and its OK. When I add one IBM SAN 16B Switch, I have imediatelly this error and Healt Service Crash. I tried to apply hotfix KB951526, but its the same.

    I have to remove this SAN switch. I use Quest sollution for VMWare and LINUX monitoring and this MP depend on SNMP device on SNMP library MP.

    So I need some sollution.

     

    Thanks Jan

    Tuesday, August 26, 2008 8:23 AM
  •  

    Well anyone? Quote: I will reply when I have the true solution? That was almost 4 months ago? Yeh I could call MS but why? To be further dissolutioned with them and put hands up in and say @$#%^^*&* it?

     

    I did a bunch of SNMP additions (8 switches) and this *** hit the fan.

     

    So they screwed us with this so called hotfix and now it hotstops? Surprised if you've been like this for last 4 months?

     

    Bueller, Bueller???? Only if you know Ferris.

     

     

     

    Event ID 5300, The local health service is not healthy. Entity state change flow is stalled with pending acknowledgment blah blah blah.......

     

     

     

    Friday, November 28, 2008 2:50 PM
  • We are aware of an issue where Essentials is unable to correctly monitor network devices that report an interface speed of 2Gb/s or greater.

     

    One of the symptoms is that the Health Service will unexpectedly exit, logging Event ID 4000 in the Event Log.

     

    The cause of this issue is separate from the issue resolved with KB951526, which is why applying that KB doesn’t resolve this issue.

     

    There are 2 possible workarounds, which will not help in every situation:

    1)     If your network device does not have interface speeds > 2Gb/s, check for updated firmware from your device manufacturer

    2)     Exclude network devices reporting interface speeds > 2Gb/s from monitoring.

     

    We are working on a fix, although I don’t have a release date yet. The fix was initiated after investigation of the case Neale opened with Microsoft Support. We haven’t provided a fix for Neale yet, which is why he hasn’t been able to post an update.

     

    I will post an update to this thread once the fix is available for download.

    Monday, December 1, 2008 11:01 PM
  • Richard,

     

    All I ask is for a little communication on these forums. You don't just leave a thread open ended for 4 months, particularly with the teething issues the System Center products have been through, otherwise it just appears to be one annoyance after another. Neale did say 'If I remember' so it seemed he didn't.

     

    Well I went ahead and opened a case anyway with Microsoft as I could not assume there has been no fix after 4 months and I need this product to work so I can move on. Who is putting up with these problems for this long? This is a monitoring system afterall! A company could lose a lot of clients, or have to write a lot of refund cheques, particularly the SMB market SCE is aimed at when SLA's cannot be met because it's monitoring system is not up to the task.

     

    Now, the issue at hand. Hang on, greater than 2Gb? Your first line states has problems greater than. You then suggest firmware for devices than do not have above 2Gb. I don't get it?

     

    All our devices are less than 2Gb and all have the latest stable firmware.

     

     

    How is this a seperate issue from KB951526? The errors are exact to the letter, though our issue does not involve any MP's, only device discovery and default enabled SCE reporting from there.

     

    thanks

     

     

    Monday, December 1, 2008 11:54 PM
  • Hi Hittin,

     

    We have seen that in some cases network devices will incorrectly report an interface speed >2Gb/s. Sometimes there is a firmware update that results in the devices reporting the correct interface speed. It isn't always an option, but has helped others. It won't help if you have devices with >2Gb/s interfaces because Essentials isn't correctly handling these faster devices. 

     

    The Event ID 4000 messages don't uniquely identify a single problem - just that the Health Service has crashed or stopped responding.  I agree the symptoms are similar, but looking at the stack traces the issues are different and require different fixes.

     

    Ask the support engineer responsible for your Microsoft Support case to contact me and if their research hasn't already found it, I can point them to an internal article that will help them confirm whether the issue you're seeing will be fixed by the hotfix we're working on.

     

    Thanks.  

     

    Tuesday, December 2, 2008 4:31 PM
  • Request an escalation to second level if you haven't already done this... good luck with MS.

    Tuesday, December 2, 2008 9:55 PM
  • Neale,

     

    Doing a trace and performing a SNMP discovery the tech has found in the TracingGUIDsNative.log.

    Cause unknown at this stage.

     

    0              00000000             [0]1448.7528::12/04/2008-15:50:17.752 [HealthServiceCommon]  Error EventLogUtil::LogEvent(EventLogUtil_cpp272)Logging error event with args 2164195371

     

     

    Richard,

     

    Gave tech link to this article, he said he would be in contact with you.

     

    thanks

     

    Tuesday, December 9, 2008 1:38 AM
  • Hi all,

    I am having the same issue in my SCE SP1 environment. I have 25 sce managed servers and couple clients. As soon as I start installing SCE agent to client machine, I get Event id 4000s and Healthservice starts using very high cpu.  

    I have installed the MS hotfix but didn't solve the problem. http://support.microsoft.com/?kbid=951526

    If any of you have update on this thread, I would really appreciate it.

    Thanks,
    Shn
    Monday, February 2, 2009 3:59 PM
  • Hi,

    I'm also waiting for an update to this thread. Richard refered me here after posting a thread with similar problems. At the moment I can only discover one of our SNMP devices without causing the Health Service to crash.

    If I can provide any information that might help, I'm happy to do so.

    Regards,
    Andy.

    Tuesday, February 3, 2009 12:02 AM
  • Thanks for responding Andy,

    I had few network devices that were creating SNMP related error messages in event viewer every time right before HealthService crashes. I stopped monitoring those devices and reboot the server but it didn't help to fix problem.

    Did you get SNMP error messages in event viewer before Health service crashes when you start experiencing the problem?

    Thanks,
    Shn
    Tuesday, February 3, 2009 2:40 PM
  • Here is an update from my case. I stopped monitoring all the network devices in my SCE environment and reboot the server. After reboot, CPU usage is back to its normal state and HealthService is pretty healty :).  Now I need to figure out which network device(s) cause this issue and keep not monitoring those devices until next update/service pack/hotfix becomes available to fix the root of the problem.

    Shn
    Tuesday, February 3, 2009 4:15 PM
  •  A short update - we have created a hotfix and are currently going through the release process to make it available to you as a download on Microsoft.com with an accompanying knowledge base article. 

    I expect we'll have the update and KB article available this month (the exact timing is difficult to predict). 

    Thanks for you patience - I realize this fix is taking a long time to appear.
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, February 3, 2009 7:07 PM
  • Hi,

    Currently have a case open with Microsoft. After providing more crash dumps on Event ID 4000 on MonitoringHost.exe than I care to remember, they have provided me with an updated System Center Essentials 2007 Network Device Monitoring Library MP.msi to test.

    I've had it installed for the last 3 days and it has not generated Event ID 4000 and therefore does not unload all of my rules.
    It does however flood my Ops Manager Event Log with Event's:

    ID 11052
    Module was unable to convert parameter to a double value Original parameter: '$data/SnmpVarBinds/SnmpVarBind[1]/Value$' Parameter after $Data replacement: '' Error: 0x80020005 Details: Type mismatch. Instance name: 53.VLANxx

    ID 21405
    The process started at 9:10:36 AM failed to create System.PropertyBagData, no errors detected in the output. The process exited with 1 Command executed: "C:\WINDOWS\system32\cscript.exe" /nologo "UtilizationCalc.vbs" 0 0. 300 true Working Directory: C:\Program Files\System Center Essentials 2007\Health Service State\Monitoring Host Temporary Files 21\615\ One or more workflows were affected by this. Workflow name: Microsoft.SystemCenter.NetworkDevice.Interface.OutboundUtilizationPercentPerf Instance name: 11.FastEthernet11


    These have been reported back to Microsoft and am awaiting a reply.

    Thursday, February 5, 2009 1:16 AM
  • And also heaps of Event ID 101 constantly.

    UtilizationCalc.vbs : Script received a speed less than or equal to zero, and can not calculate utilization

    Thursday, February 5, 2009 1:24 AM
  • Thanks for the updates.

    Shn, I don't get any SNMP errors in the event log before the health service crashes. I also find that the 4000 event ID appears almost immediately after discovering the device, if this helps you to work out which devices are causing the problem.

    Regards,
    Andy.

    Thursday, February 5, 2009 10:19 PM
  • Any update on how this hotfix is progressing?

    Thanks,
    Andy.
    Tuesday, March 10, 2009 10:02 PM
  • Hi

    Is there any hot fix available by this time , as I am also observing on many agents same problem. It could be great if you can help on this issue.

    Thanks
    Obul
    obula
    Monday, March 16, 2009 11:55 PM
  • The updated Network Device Monitoring Library Management Pack is available to download from:
    http://www.microsoft.com/downloads/details.aspx?FamilyID=8200e405-f871-4f19-a991-0411285fcbe5&displaylang=en

    The related KB article (KB960569) and a listing in the System Center Essentials Management Pack catalog will appear in the next week or so.

    The new Management Pack is not upgrade compatible, which means that you will need to delete the existing Network Device Monitoring Library Management Pack before importing the new version. Information on how to delete the Management Pack is listed at the bottom of the download page.

    I did delay the release of this Management Pack since after installing it Hittin was still seeing errors and I wanted to know if it was due to the changes made in the Management Pack, or something else. It looks like the errors fall into the "something else" category and it is still being investigated. 

    The new Management Pack does stop the Health Service crashing with an Error 4000 event if Network Devices are returning interface speeds >2GB/s.   

    Thanks

    Richard
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, March 24, 2009 3:18 PM
  • Hi,

    So now my Custom MP's seem to be an issue, crashing the MonitoringHost.exe processes. Related to OLEDB and Web Applications custom monitoring via Authoring/System Center Templates/.


    One DUMP appears to be looking for a Certificate, even though there is no certificate involved between the RMS Essentials server (watcher node) and the SQL Servers OLE connection strings because they are all in the same AD domain. They fail at this point.

    However yes, at the same time, the RMS is monitoring Workgroup Servers for which it is a CA, and distributed certificates to these servers for Mutual Authentication.

    Also, Web Application Templates freeze as well. No certificates apart from the local domain CA generated and that info comes through fine to the console from the installed agent. There's no SSL involved to complicate matters.

    Cause unknown, answer appears to be an X-File. Seriously, Mulder and Scully couldn't solve this one, not that they've solved much prior to this.






    Friday, April 3, 2009 2:05 PM
  • Problem was User Dump Process and SCE tracing enabled together caused MonitoringHost.exe to crash dump.
    Nothing wrong with Custom MP's or OleDB or Web Applications Monitoring.
    Wednesday, July 22, 2009 3:56 AM
  • I had an error event DHCP event 1003. I ran the above fix which I extracted but received no response. Should I go to the extraction and redo it?

    Charlene


    Charlene

    Thursday, July 5, 2012 6:36 AM