locked
Reinvoking script in management pack from SCOM 2007 R2 RRS feed

  • Question

  • Hello,

    I am developing a management pack which is used to get alerts from some servers. The basic implementation is that I first bring data from the server and maintain a cache file (XML) and then all the Discoveries and Rules pulls data from this cache files and populates SCOM. The task which perform this operation (loading cache directory) works perfectly great when everything goes smooth. The problem occurs when there is a power failure to those servers. Task is:

    <Tasks>
    	<Task ID="MP.$TemplateConfig/TypeId$.Proxy.LoadCache.Task" Accessibility="Public" Enabled="true" Target="MP.$TemplateConfig/TypeId$.Proxy" Timeout="300" Remotable="true">
    	 <Category>Custom</Category>
    	 <ProbeAction ID="ProbeAction" TypeID="$Reference/Self$MP.Proxy.LoadCache.ProbeAction">
    		<CacheClass>some parameters</CacheClass>
    		<LogingLevel>0</LogingLevel>
    		<SecureInput>$RunAs[Name="MP.$TemplateConfig/TypeId$.Proxy.SecureReference"]/Password$</SecureInput>
    		<TimeoutSeconds>120</TimeoutSeconds>
    		<TypeId>$Target/Property[Type="$Reference/Self$MP.Proxy"]/TypeId$</TypeId>
    		<Url>$Target/Property[Type="$Reference/Self$MP.Proxy"]/Url$</Url>
    		<UserName>$RunAs[Name="MP.$TemplateConfig/TypeId$.Proxy.SecureReference"]/UserName$</UserName>
    	 </ProbeAction>
    	</Task>
    </Tasks>


    In case of power failure, SCOM tries to fire the script but won't get response from the servers. So I override the MP object (Discoveries and Rules) Intervalseconds to 300 seconds, to fetch data frequently. If the servers are up within 300 seconds, I think (not sure), this task loads the cache directory properly but if the time of fail-over increases to 10 minutes(say), then SCOM wont fetch any data. Can somebody tell me why this happens? Is this due to timeoutseconds? So I have to reconnect to those servers again (done by custom monitoring template). If servers are pinging , the script starts working.

    So I want to know is there is a way to re-invoke this task or my script (any method) for re-establishing the connection regardless of manually doing the same?

    I also want to know about <TimeoutSeconds> in respect to UnitMonitor, Task, Rule etc..... are they mean same?
    Regards, Ravi

    Tuesday, August 9, 2011 1:29 PM

Answers

  • The XML above is for your task.  It does not run unless it is manually run.

    Your question is about why workflows that are scheduled (discoveries, monitors) stop running after several failures - the answer is that if a workflow fails a few times in a row, it is unloaded and will not be run again until you restart the health service that is running those workflows.

    Making the discoveries run more frequently makes the problem worse - since you are increasing the # of times the workflows fail in a row.  By changing the timing with your ovrrides, you are causing the problem nearly certainly.  If anything - fix the power, and don't mess with the timing.

     

     


    Microsoft Corporation
    • Marked as answer by Ravi_Raj Tuesday, August 9, 2011 5:18 PM
    • Unmarked as answer by Ravi_Raj Wednesday, August 10, 2011 8:41 AM
    • Marked as answer by Yog Li Thursday, August 18, 2011 8:07 AM
    Tuesday, August 9, 2011 3:26 PM
  • The XMl above has nothing to do with your discoveries.  Changing any part of it will have no impact on discoveries that have stopped. 

     

    Yes, they stop permanently.  You have to restart the health service that runs them to get them to run again.  Do think about whether these are the agents on the computers that are turned off - you can't make those run by changing management pack configuration since the computers are turned off.

     


    Microsoft Corporation
    • Marked as answer by Ravi_Raj Tuesday, August 9, 2011 5:18 PM
    • Unmarked as answer by Ravi_Raj Wednesday, August 10, 2011 8:42 AM
    • Marked as answer by Yog Li Thursday, August 18, 2011 8:08 AM
    Tuesday, August 9, 2011 3:37 PM

All replies

  • you just need to look at the datasource/writeaction you are using what it means. for most mp's it will just mean the amount of time a script/task is allowed to run.


    Rob Korving
    http://jama00.wordpress.com/
    Tuesday, August 9, 2011 3:21 PM
  • The XML above is for your task.  It does not run unless it is manually run.

    Your question is about why workflows that are scheduled (discoveries, monitors) stop running after several failures - the answer is that if a workflow fails a few times in a row, it is unloaded and will not be run again until you restart the health service that is running those workflows.

    Making the discoveries run more frequently makes the problem worse - since you are increasing the # of times the workflows fail in a row.  By changing the timing with your ovrrides, you are causing the problem nearly certainly.  If anything - fix the power, and don't mess with the timing.

     

     


    Microsoft Corporation
    • Marked as answer by Ravi_Raj Tuesday, August 9, 2011 5:18 PM
    • Unmarked as answer by Ravi_Raj Wednesday, August 10, 2011 8:41 AM
    • Marked as answer by Yog Li Thursday, August 18, 2011 8:07 AM
    Tuesday, August 9, 2011 3:26 PM
  • I get what you mean to say but How can i sync the timings for all these events.

    What happen after timeout seconds? Will the script stops permanently? How can I reinvoke them?


    Regards, Ravi
    Tuesday, August 9, 2011 3:33 PM
  • The XMl above has nothing to do with your discoveries.  Changing any part of it will have no impact on discoveries that have stopped. 

     

    Yes, they stop permanently.  You have to restart the health service that runs them to get them to run again.  Do think about whether these are the agents on the computers that are turned off - you can't make those run by changing management pack configuration since the computers are turned off.

     


    Microsoft Corporation
    • Marked as answer by Ravi_Raj Tuesday, August 9, 2011 5:18 PM
    • Unmarked as answer by Ravi_Raj Wednesday, August 10, 2011 8:42 AM
    • Marked as answer by Yog Li Thursday, August 18, 2011 8:08 AM
    Tuesday, August 9, 2011 3:37 PM
  • Ok I got the method, but here I got redundant data from the cache.

    Since SCOM pulls data from cache directory. The behavior is that everytime discovery fetch data from the server and replaces old files there with new values. This is great.

    Now at the time of fail-over, the server is not pinging, discovery generates "Microsoft.XMLHTTP Send: The operation timed out" in the event log but the cache data is already there so SCOM continues to fetch data from this cache, giving false implication that sever is up n running.

    So I want to implement some way that SCOM tells me that server is down and not pinging.

    Again, How TimeoutSeconds and IntervalSeconds come into play?


    Regards, Ravi
    Wednesday, August 10, 2011 8:52 AM
  • Neither.  Use the agent heartbeat feature (default = on) to detect agents that are not available when a server is down.
    Microsoft Corporation
    Monday, August 15, 2011 3:11 PM
  • Hi Ravi,

    can you let me know how you are pulling data from XML?

    Wednesday, August 17, 2011 2:51 AM