locked
SCOM Alert rule throws exception when there are a lot of alerts occours RRS feed

  • Question

  • hello,

    I hava a scom alert rule which use event log as data source. Everything works in normal case.

    But when there are a lot of alerts(event log records) generated in a short time, SCOM could not sync correctly. When i check "Operation Manager" Event log, it says:

    - SCOM2016: "Data was dropped due to too much outstanding data in rule ...."

    - SCOM2019: 

    "

    A rule has generated 50 alerts in the last 60 seconds.  Usually, when a rule generates this many alerts, it is because the rule definition is misconfigured.  Please examine the rule for errors. In order to avoid excessive load, this rule will be temporarily suspended until 2020-03-30T17:15:58.1712691+08:00. 
    Rule: Alarm.To.SCOM.Alert.Rule.1
    Instance: mock.xxxx.com 
    Instance ID: {6861ACE6-8026-572A-4AA1-0D6CA0C01303} 
    Management Group: scom2019

    "

    I try to seperate one rule into multiple rules (use filter), but it seems all rule with same target shares the limit count. How I can fix this, reduce event log source alert count is not an option (there are really so many alert sometimes).

    What is the SCOM rule generate amount limit strategy, i could not find any document about it.

    Thanks




    Friday, March 20, 2020 10:34 AM

All replies

  • What kind of event is it picking up where its so frequent, it sounds like you are getting an alert storm from it where its overloading the data.

    If its a genuine alert then you may have to override/disable the rule for a while and sort why its coming up so many times. Or you can reconfigure the rule to a repeated event rule/monitor and specify a time or amount of times it gets created which should reduce the amount of data you are getting


    Website: www.walshamsolutions.com Technical Blog: https://www.walshamsolutions.com/technical-blog Personal Blog: https://www.walshamsolutions.com/personal-blog Twitter: Dwalshampro

    Friday, March 20, 2020 11:05 AM
  • If its a genuine alert then you may have to override/disable the rule for a while and sort why its coming up so many times.

     As I said, "reduce event log source alert count is not an option". We are manage more than 10,000 machines. And we have a isolate service running in SCOM server which will sync events and alarms of those machines.

    Or you can reconfigure the rule to a repeated event rule/monitor and specify a time or amount of times it gets created which should reduce the amount of data you are getting

    Not sure what you mean, I try to seperate one rule into a lot rules target for different event-id. generally, it looks like:

                    <Expression>
                      <SimpleExpression>
                        <ValueExpression>
                          <XPathQuery>EventDisplayNumber</XPathQuery>
                        </ValueExpression>
                        <Operator>Equal</Operator>
                        <ValueExpression>
                          <Value Type="UnsignedInteger">${Different-event-id}</Value>
                        </ValueExpression>
                      </SimpleExpression>
                    </Expression>


    But it seems not work, still got same error.

    what is the SCOM rule limit strategy, i could not find any document about it.

    • Edited by Qianbiao.NG Friday, March 20, 2020 11:51 AM
    Friday, March 20, 2020 11:45 AM
  • If you have 10,000 machines generating this much data then something has to be filtered to some degree or you will constatnly have a configuration churn everytime and eventually impact the Data Warehouse aswell.

    What kind of event ID or event are you trying to monitor as to why its making so many alerts? Is the event you are trying to monitor come up constantly by default? Because if these are genuine event errors you are monitoring and they are coming in this fluently you will have to address it or at least address the machines thats making the most noise from it

    I am saying you can specify a count of how many events you can get within a certain time and generate an alert based on that such as i.e. if i get 5 in one hour 

    Im not sure if there is a specific limit on the rule alert generation might be more on how much it generates within a certain interval but im not certain if this is documented either


    Website: www.walshamsolutions.com Technical Blog: https://www.walshamsolutions.com/technical-blog Personal Blog: https://www.walshamsolutions.com/personal-blog Twitter: Dwalshampro


    • Edited by Dwalsham Friday, March 20, 2020 12:07 PM more words
    Friday, March 20, 2020 12:05 PM
  • We have a lot of machines to manage. And those machine will generate events/alerts like: "Add device", remove device, CPU temp too hot, memory not absent, disk not heath, and etc.

    Currently, we are limit generate rate with a rate limiter which will generate at most 50 alert in any one minute long time(basiclly, using an 50 length long list to store the generate time of last alerts). But this is not effective, we want to improve the performance. And sometimes scom does not run rule for minutes, then scom will got more than 50 alerts in next minute, error still occurs.

    We hope we can get the details about the SCOM rule limit

    Friday, March 20, 2020 12:46 PM
  • Hi,

    Which EventID are we monitoring? Does all the events shows the same parameter? For example:

    EventID=4506
    Severity=Error Message=Data was dropped due to too much outstanding data in rule "%2" running for instance "%3" with id:"%4" in management group "%1"

    If one of the parameter differs, we may create several rules based on the criteria to reduce the records generated in a short period of time.

    Hope the above information helps.

    Regards,

    Alex Zhu
    -----------------------------------------------
    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.
    Monday, March 23, 2020 1:04 AM
  • Hello alex,

    It seems one alert will cause all related rule failed. I have seperate a rule into multiple rules group by event-id.

    For example, A Rule is target for instance type "baremetal server", and i will create 10 rules with different "EventDisplayNumber" for this "baremetal server".

    But in fact, no matter which event-id cause the "data dropped", all 10 rules will raise a same error. It seems SCOM will count the limit only base on target type ("baremetal server")? this is unreasonable. If it limit for every target instance, everything will be ok.

    Tuesday, March 24, 2020 7:29 AM