none
HealthServices.exe consuming high CPU (40-60%) on cyclic peaks

    Question

  • Greetings,

    There's already quite a lot of messages on the matter, including on these forums (see this topic where I posted a reply, but I decided to start a new one since I think it has no relation with cscript actually). So my issue seems to be slightly different; here we go :

    Since we upgraded to SCOM 2007 R2, a new critical error popped on a few of our servers :
    Entity Health -> Performance -> Windows Local Application Health Rollup -> Performance - [server FQDN] -> Health Service Performance - [server FQDN] -> Agent processor utilization - [server FQDN]

    Related alert name is: The Operations Manager agent processes are using too much processor time.

    I think it is not related to 2007 R2 itself, but rather on the fact that the upgraded Management Packs now detect this error with these new settings.

    Our environment is Windows 2008 domain (not R2 yet), SCOM 2007 R2, and the servers experiencing this issue are Sharepoint 2010 servers (Application and Web FrontEnds, not the SQL ones) and a remote RODC. They are all running Windows Server 2008 R2 (but as I said, the domain is still "R1").

    I followed all the instructions & suggestions on the Knowledge tab of the error, but it didn't help me much:

    • KB968967 (MSXML 6.0) is auto-updated with Windows Update and it appears that Win 2008R2 already has it fixed. I tried applying the fix anyway (the one downloaded from the official site, as well as the one I found on SCOM 2007 R2 media) but it didn't let me ("cannot be applied").
    • Cscript: we have no more Win2K3 servers or previous, so our system has no issue with cscript version.
    • Agent version is 6.1.7221.0 which seems to be up to date.
    • Effective Configuration Viewer: everything seems to be quite normal. Of course, Sharepoint 2010 servers have a lot of installed SCOM components, but nothing out of the ordinary, as far as my knowledge can affirm.
    • Review actual CPU usage: there is link bringing me to the Agent Performance View. There, I compare the faulty servers with others, and I can notice a huge difference in the graphs:

    Example 1: SharePoint 2010 Application server and 1 Web Front End. Note the units on the left: CPU often goes to 40-60%, on a quite regular basis. If the criteria of the rule are met, (6 consecutive Critical state >20%), which happens a LOT, the alert is fired.

     Example 2: an Exchange 2007 server and a Moss 2007 web FrontEnd , everything seems to be ok and acceptable: never higher than 17, and most of the time quite low. I took these servers on purpose: We have very active Exchange environment, and the MOSS 2007 environment runs a website visited by thousands of visitors.

     I would be up to guess that maybe the Sharepoint 2010 Management Pack is the problem, but as I said earlier I have the exact same problem on a remote RODC, which has nothing to do with SharePoint.

    So my questions are:
    - Maybe these values are normal and acceptable under certain circumstances. Should I override the settings for these servers and maybe put a treshold to like 50% or a higher consecutive sample check before alert ?
    - If not, how can I push the investigation further and help me decrease this CPU consumption ?

    Thanks for reading,

     


    Monday, March 28, 2011 10:13 AM

Answers

  • Spin up procmon on the affected agents, and filter cscript.exe.  Correlate the high cpu with the scripts that are running in that timeframe.  Find the workflows executing those scripts and offset the interval by at least 30 seconds on each of them.  Try to stagger the intervals as much as possible.  If there are no other apparent issues on the agent, then this is the best way to effectively spread these script-based workflows over time so the computer can handle it better.
    HTH, Jonathan Almquist - MSFT
    Tuesday, April 05, 2011 3:00 AM
    Moderator

All replies

  • To get a sense of what is on those SharePoint 2010 agents, try running the running wokflow report.  If you have set up the mp to monitor all sites and you have a lot of sites, you could have a high workflow count.

    The SharePoint 2010 MP does a lot of scripted work - see if your discoveries can be tuned back to once a day.  Look thru the running workflow report and see if you are comfortable shutting off any of the workflows.  Do you also have the IIS mp on these servers?  Try making IIS not discover sites.


    Microsoft Corporation
    Monday, March 28, 2011 3:47 PM
  • Hello Dan,

    Many thanks for your answer.

    I'll try to run the workflow report as you suggest. We don't have that many sites, plus the load isn't that high because SharePoint 2010 doesn't host our main sites (they are still on 2007 at the moment).

    We have the IIS MP on those servers indeed. I'll see what I can do about IIS sites ;)

    Anyway, I'll keep you informed, thanks again for the suggestions !


    Bix Belgium
    Monday, March 28, 2011 3:51 PM
  • Dan,

    I've searched through the whole console and I cannot find the "running workflow report". Searching this string on Internet only gives ... links to technet posts you have written :)

    The only report I find with the name "workflow" is in Reporting -> System Center Core Monitoring Reports -> Data Volume by Workflow and Instance". I ran it but considering the results, I suppose it's not the one you mentioned .. ?

    Can you tell me where I can find this report and/or how to build it ? Please excuse my weak experience with Reports and the probably dumb question.

    Meanwhile, I've noticed that another server is suffering from the "Agent processor utilization" : another Win2K8R2 server with the following roles : DHCP, DFSN and KMS.

    That leaves us with the Sharepoint 2010 server, a remote RODC and this DHCP/KMS/DFSN server, and one of them with a CPU utilization graph similar to the "Example 1" here above, i.e. frequent peaks to 40-60 and rare peaks to 80-100.

     

     


    Bix Belgium
    Friday, April 01, 2011 12:31 PM
  • Monitoring --> management pack 'Operations Manager", agent health view, click on the agent in the view, look at the tasks link - you will see running workflow report.

     


    Microsoft Corporation
    Friday, April 01, 2011 3:52 PM
  • I see something similar under "Agents By Version" called "Show Running Rules and Monitors for this Health Service" is that it?
    Friday, April 01, 2011 4:31 PM
  • I didn't find any task called "Running workflow", but rather like andyinsdca suggests : "Show Running Rules and Monitors for this Health Service.". And within this report there is a "Total Workflows running" value, I suppose we're on the good way !

    I take 4 types of server in my domain and execute this task on them, and write down the "Total Workflows running" value :

    1. A server in critical state because of the Agent CPU utilization (Sharepoint 2010 WFE server) : 1134
    2. A server amond those usually in critical state but which is not in Critical state when the report is made (Sharepoint 2010 Application Server) : 1488
    3. The same, but not in Sharepoint 2010 environment (Remote RODC) : 1262
    4. 2 server with low Agent CPU utilization :
      - MOSS 2007 Web Front End : 1055
      - Exchange server : 1328

    If I'm getting the correct value, I don't think it is relevant to explain the differences in CPU utilization between those servers....

    And about the IIS lead, all sharepoint servers do have an IIS and I'll give it a look, but I can already say that the remote RODC (suffering from the same processor issue, thus) doesn't have any IIS installed.

    Argh this is driving me crazy. I'm about the override the rule and set all the faulty server's treshold to 60% for agent cpu utilization but this is so lame ... :-/

     

     


    Bix Belgium
    • Edited by Bixessss Monday, April 04, 2011 12:05 PM typos
    Monday, April 04, 2011 12:05 PM
  • Spin up procmon on the affected agents, and filter cscript.exe.  Correlate the high cpu with the scripts that are running in that timeframe.  Find the workflows executing those scripts and offset the interval by at least 30 seconds on each of them.  Try to stagger the intervals as much as possible.  If there are no other apparent issues on the agent, then this is the best way to effectively spread these script-based workflows over time so the computer can handle it better.
    HTH, Jonathan Almquist - MSFT
    Tuesday, April 05, 2011 3:00 AM
    Moderator
  • Hi Bix,

    As this thread has been quiet for a while, we assume that the issue has been resolved. At this time, we will mark it as "Answered" as the previous steps should be helpful for many similar scenarios.

    In addition, we’d love to hear your feedback about the solution. By sharing your experience you can help other community members facing similar problems.

    Thanks,


    Yog Li -- Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.
    Wednesday, April 13, 2011 6:11 AM
    Moderator
  • We get this as well on our Windows 2008 R2 agents, we don't get it on 2003 or 2008 monitored servers, others on these forums have expressed they get similar 'too much processor time' warnings on their 2008 R2 servers. To me, the common factor appears to be 2008 R2, so  a bug in SCOM or Windows 2008 R2 perhaps?

    Wednesday, April 13, 2011 12:16 PM
  • Hello,

    Sorry I have been sick the last week and haven't been able to get back on the issue.

    Thank you Jonathan for your solution, I just gave it a try:

    Using procmon, I captured events for 1 minute, filtering the process name to cscript.exe. Here are the numbers I get:

    • On a server usually suffering from heavy agent cpu utilization but OK this time (remote RODC) :
      55771 / 180847 events (30%)
    • On a server currently in Critical State with a critical error about Agent CPU Utilization (SharePoint 2010 WFE) :
      6695 / 115347 events (5%)
    • On a server not suffering from agent high cpu utilization (Exchange 2010) : 14213 / 1374264 events (1%)

    So the numbers themselves don't sound promising ... As you suggest it, I suppose I should go and look into these thousands of events on a faulty machine and correlate them with their workflows and try to equilibrate/stagger them....

    That's one hefty job but I guess it's my only option ?

    Or maybe as Steve mentions it here under, it is a "normal" issue on Win2008R2 servers ? ..

    Thanks again for your time and your answer


    Bix
    Wednesday, April 13, 2011 2:53 PM