none
SCOM2012R2 - high count of omiagent processes on Linux (>100) RRS feed

  • General discussion

  • Hi all,

    we just upgraded one of our SCOM environments from 2012 SP1 to 2012 R2 UR5. Now I found servers with a high count of omiagent processes and I just wanted to know if that is normal?!

    Agent Version: 1.5.1-150 (Labeled_Build - 20150121L) running on Suse Linux Enterprise Server 11 SP2.

    scomuser@justaserver:~> ps -ef |grep omi | wc -l
    162
    
    scomuser@anotherserver:~> ps -ef |grep omi | wc -l
    271
    
    Looks something like this:
    
    scomuser 28725 18387  0 Mar15 ?        00:02:18 /opt/microsoft/scx/bin/omiagent 10 13 --destdir / --providerdir /opt/microsoft/scx/lib --idletimeout 90 --loglevel WARNING
    scomuser 29188 18387  0 Mar15 ?        00:02:32 /opt/microsoft/scx/bin/omiagent 10 13 --destdir / --providerdir /opt/microsoft/scx/lib --idletimeout 90 --loglevel WARNING
    scomuser 29298 18387  0 Mar14 ?        00:03:14 /opt/microsoft/scx/bin/omiagent 10 13 --destdir / --providerdir /opt/microsoft/scx/lib --idletimeout 90 --loglevel WARNING
    ...

    Monday, March 16, 2015 7:24 AM

All replies

  • Typically not normal. What happens if you stop the agent [scxadmin -stop]. Do all omiagents shutdown? If you restart the agent [scxadmin -start] do they all start back up again? Typically you should only see one or two omiagent daemons running at any given time but it is possible the agent is not shutting down properly and leaving instances in a hung state. Do you see any errors in the logs under /var/opt/microsoft/scx/logs?

    I'm assuming you did not experience this when you were running SCOM 2012 SP1 correct? When converting the agents from SP1 to R2, did you do an upgrade or a new install of the agent? Just trying to get an idea of how I can repro this.

    Thanks,

    -Steve

    Monday, March 16, 2015 2:19 PM
    Moderator
  • Hi Steve,

    after stopping the agent, only the active processes shut down (omiserver and two omiagents). The rest of the hundreds of processes stay in the process list. I have to kill them with "killall /opt/microsoft/scx/bin/omiagent". After starting the agent again, there are again only 3 processes. But I don't know yet for how long, I have to wait and check again.

    I deleted all old logfiles now and will check later again if the problem occurs again. What I have seen so far is a lot of log entries in the omiserver.log like this one:

    2015/03/16 16:13:28: WARNING: lost connection to agent running as [3733]
    2015/03/16 16:28:32: WARNING: lost connection to agent running as [3733]
    2015/03/16 16:58:31: WARNING: lost connection to agent running as [3733]

    And in the omiagent.root.root.log

    2015/03/16 16:13:28: ERROR: Error on read for socket 10: Connection reset by peer
    2015/03/16 16:28:32: WARNING: _WriteV - Error on writev for socket 10: Broken pipe
    2015/03/16 16:58:31: WARNING: _WriteV - Error on writev for socket 10: Broken pipe

    I did not experience this with SCOM 2012 SP1, right. I updated the agents by pushing them through the console. It was not a new install. What I can check is to remove the agent on one of the servers completely and do a new install and see if the problem still occurs.

    I will get back to you as soon as I have more information.

    Monday, March 16, 2015 4:11 PM
  • Hi,

    after removing the RPM package from a server and doing a new install, up to now the problem did not appear again. I watched the behaviour now for about 5-6 hours, everything seems to work fine.

    Maybe I can isolate this more...

    Tuesday, March 17, 2015 1:32 PM
  • It most likely has something to do with the upgrade process as SP1 uses the old Open Pegasus agents and R2 is now using the new OMI agents. My guess is something got hung in the upgrade process which caused the omiagent to hang and not exit properly. If you want to try to isolate it that is where I would start as it seems that doing a clean install instead of an upgrade works.

    Regards,

    -Steve

    Tuesday, March 17, 2015 1:44 PM
    Moderator
  • I've seen the same thing occur, however the frequency of the occurrence makes it hard to know when you resolve it as it can take days/weeks to reoccur. I can confirm this only appears to happen to a small number of servers and all were updated exactly the same way.

    omiserver and omiagent logs both point to  a break in connectivity like you mentioned

    A work around would be simple, put in a process monitor, as up to 4 omiagents appears normal set this to alert at 5. Either have someone respond to the alert, or setup a recovery to ssh in/winrm and fix the issue.

    I'm quite keen to hear if you find anything else out.

    Thursday, March 19, 2015 12:22 AM
  • I have same issue and I'm working with MS for 3 weeks now on this

    Looks like there is a bug in the agent and if the output returned back by the unix script/command is bigger than certain size you are seeing more and more omiagent processes. 

    I will update this thread as soon I have something from MS

    Thursday, March 19, 2015 3:30 PM
  • Hopefully it takes less time than the support one we had for an issue with open Pegasus and its resource usage, after a couple of months they still had no idea and we finally updated to R2, only to eventually have this issue.

    Its clear something is breaking the communication between agent processes, just not what, sadly tracing fills disks to quickly to try and capture this when it occurs

    It's nice to know that updating to the latest agent wont resolve the issue. Updating Unix management packs is a pain, but that's another story.

    Sunday, March 22, 2015 10:45 PM
  • We have verified the initial issues posted is being caused by using commands/scripts in custom monitors that return >64K of data. There is a 64K limit to any instance size handled by OMI, and the command/script invocations return a single instance of output. Thusly, only 64K of data for the total instance are supported, including StdOut+StdErr+ReturnCode. This is a bug in that we should not be leaving a dead omiagent process around when this happens. We intend to fix this bug but it should be noted that we are not planning to change the 64K instance size limit, so while our proposed fix would avoid the issue of numerous omiagent processes, it would not make the command/script monitors or rules functional if they are returning >64K of data.

    If you are experiencing this issues, you should dig into your command/script workflows and see what they are returning and reduce the output to <64K.

    Regards,

    -Steve

    Monday, March 23, 2015 5:08 PM
    Moderator
  • Hi Steve,

    Are you planning to have a way to customize the 64K limit?

    how we will detect if the data is being truncated ?

    Thanks,

    Marius

    Wednesday, March 25, 2015 4:46 PM
  • And what happens if at most you are returning a single line to SCOM for any monitor/rule where you implement a custom script?

    I don't imagine that will ever break 64k, even with an excessively large error message written to stdErr.

    This I nice to know for some of the diagnostics I'm looking at implementing but currently we get nothing even remotely that large returned for any custom unix monitors/rules we have put in place as everything is designed for a single value or single line of output being returned only.

    Would this also be an issue for winrm calls and values being returned, or is it purely script execution?

     

    Monday, March 30, 2015 11:26 PM
  • The reason I ask about winrm is  some of these are quite large i.e. process monitoring on one of the servers that this occurs on returns 249,216 bytes when you enumerate the SCX_UnixProcess provider. If this is also limited I'm going to have lots of problems.
    Monday, March 30, 2015 11:54 PM
  • A mi me paso igual, lo que hice fue:

    - Dar de baja el servicio de scx-cimd (service scx-cimd stop)

    - kill a cada uno de los procesos de omiagent

    - Reinciar el servicio de scx-cimd (service scx-cimd start)

    - Subir el omiagent (omiagent --loglevel 0)

    Con esto, el uso del CPU que estaba al 100% bajo a un 5% aproximadamente.

    Thursday, August 27, 2015 9:01 PM
  • We have verified the initial issues posted is being caused by using commands/scripts in custom monitors that return >64K of data. There is a 64K limit to any instance size handled by OMI, and the command/script invocations return a single instance of output. Thusly, only 64K of data for the total instance are supported, including StdOut+StdErr+ReturnCode. This is a bug in that we should not be leaving a dead omiagent process around when this happens. We intend to fix this bug but it should be noted that we are not planning to change the 64K instance size limit, so while our proposed fix would avoid the issue of numerous omiagent processes, it would not make the command/script monitors or rules functional if they are returning >64K of data.

    If you are experiencing this issues, you should dig into your command/script workflows and see what they are returning and reduce the output to <64K.

    Regards,

    -Steve

    Hi Steve,

    Is there any news on this matter?

    Thanks

    Tommy

    Thursday, November 26, 2015 8:34 AM
  • I also found the only monitors that exceeded the 64k response were the built in ones, especially the process monitor. All of the ones I create return a one line string at most so don't trigger this limitation (ie all return a lot less than 1k)

    Monitoring of processes always seems to be correct, we just seem to have these omiagent processes stall, and only sometimes, there in no guarantee of triggering this issue. To avoid wasted resources we simply monitor for this, otherwise you can get well over 100 of these omiagents hanging around in no time

    A simple process monitor with a minimum of 0 and maximum of 5 omiagents will see you good for normal functionality, but alert as soon as you go above the maximum needed from my experience monitoring around 500 unix servers of various OS flavors and functionality

    Then either an operator or a recovery task can address the issue (simply pkill the omiagent processes. The server process will create new ones and there seems to be no real negative impact overall. 99% of the time this addresses the issue and you'll see nothing more occur. There are the 1% that may just need time to settle down or further investigation. I had one do this Friday, it kept spawning more processes for 12 hours then simply started to behave again. I was expecting to have to investigate further today.

    Regards

    Dwayne


    Sunday, November 6, 2016 10:34 PM
  • Well there was bound to be someone using 2012 pre R2 still in 2015, and maybe even a year on.

    The CIM agent is, to put in nicely, rubbish

    It will go high cpu at the drop of a hat.

    I would recommend updating to 2012R2 minimum so you get the omi provider. While it does give you the odd stray process it wont break your production environment in a matter of seconds.

    If you can't upgrade simply nice the cim processes so they cant get more than a few percent of the cpu time at most. In this state you can now monitor for the agent using more than x % and generate monitoring alerts. Without nicing the process it will use so much cpu you cant monitor for its usage.

    We had a Microsoft support case open for this just after R2 came out it pushed the business to approve the upgrade to R2, and associated risk, and it was not a smooth upgrade for us. 

    Given our non trivial environment (i.e. 29 monitoring servers) there were all sorts of things we needed to do to allow for us to upgrade successfully to R2 (builtin hard coded time limits caused the database upgrades to keep rolling back for instance, we had to change from the default maximum to 18 hours. Microsoft just said delete data from your database and did not supply the registry keys needed to allow for this to complete without loss of data, In fact we replicated the SQL cluster and the clean scripts took longer to run than it took us to find this registry key, change it  and perform the upgrade successfully. A very disappointing support case.

    Regards

    Dwayne

    Sunday, November 6, 2016 10:55 PM