locked
Some advice/suggestion on Management Pack architecture RRS feed

  • Question

  • Hello,

    I want help regarding the architecture of management pack (I am very new to this field) and how to do some improvements in my custom developed MP.

    Scenario: I have developed management pack (totally script based) which works on XML polling from a cache file (which updates by MP @10 minute interval). So everything from the server hierarchy (to be populated in Diagram View) to alerts (faults in my case) are polled from the cache file itself. Now discovery part is great but gathering faults (alerts) is done by Rules and not monitors as I am fetching faults definition from external source and populating in SCOM.

    Again, there are 6 severity levels in my case but I have to map it to three to be displayed in SCOM so in rule severity level I have this definition:

    <WriteActions> <WriteAction ID="GenerateAlert" TypeID="$Reference/Health$System.Health.GenerateAlert"> <Priority>1</Priority> <Severity>$Data/Property[@Name='Severity']$</Severity>

    This is doing great in SCOM and everything goes smooth. Also in the override wizard I get severity like this:

    Problem:

    1) Overriding Severity: Most of the case people want to override the severity of a particular alert (done by all other MP). But in my case as a particular Rule fetches all alert of a particular component (class) having 6 different severity levels so If I override this rule then all the faults goes to the override category, which is a false message. I am not sure how to disable severity.

    I am looking a better way to implement this.

    (In other MPs I have found that every definition is in MP and one thing is related to one thing only like one rule/monitor has one kind of alert so if someone want to override it, they can do with no difficulty.)

    2) Monitors instead of Rules: I guess in case of XML polling faults are gathered from Rule. But what if I have to implement this using monitors? Do I have to encode all the faults definition in my MP (I have around 1400 faults) which is a big task. Again I gave one server component (say adapter). I checked it has 200 different kind of various faults itself. So how can I implement it into monitors and how to trigger it in SCOM? (or should I continue using XML polling)


    Regards,
    Ravi

    • Edited by Ravi_Raj Tuesday, March 20, 2012 6:49 AM
    Monday, March 19, 2012 7:12 AM

Answers

  • So you want 50-100 alerts or just 1 when the status code is the same?

    As the first means you need to have a filter so you can distinguish between the alerts other than statuscode (you obviously can as you say they are different). When you just want 1 alert when this statuscode happens 1+ times, i suggest you change your script to just count the occurences and return 1 bag per fault code with a property count and its value.


    Rob Korving
    http://jama00.wordpress.com/

    • Marked as answer by Yog Li Tuesday, April 3, 2012 3:54 AM
    Wednesday, March 28, 2012 8:05 AM

All replies

  •  

    Hi,

    Regarding override severity, I would like to share the following article with you, which may be useful for you:

    Alert Severity and Priority use with override

    http://blogs.msdn.com/b/mariussutara/archive/2007/12/17/alert-severity-and-priority-use-with-override.aspx

    Meanwhile, I would like to know how you develop the MP file.


    Alex Zhao

    TechNet Community Support

    Tuesday, March 20, 2012 9:59 AM
  • This link really help me understanding the way severity override works. But in my case a rule fetches alerts for a component (like server 1 rule fetches alerts related to server 1 which are around 500 {suppose 300 are mapped to info, 150 mapped to warning and 50 mapped to critical in SCOM} different kind). Now if I apply severity override to this rule (say value 1), then all the alerts go to warning category (in spite of critical, warning or info), which is a false information. So I cannot apply any severity override.

    But can there be a way to overcome this issue (disabling the override)? Like creating rule for each alerts values (its a bit more hectic and more labor is required) but is it feasible? Can you suggest a better way of doing things?

    What architecture is followed in Microsoft MPs? Do they have all the alerts hard-coded into these MPs?

    Meanwhile, I would like to know how you develop the MP file.

    I am sorry but I am not sure what do you like to know.


    Regards,
    Ravi

    Tuesday, March 20, 2012 11:19 AM
  • If you really want to control which kind of alert is triggered per event, a dedicated rule is the best option. This also has the added benefit to provide specific knowledge within the alert-definition to allow operators to quickly have an idea of what the problem (and solution) for the alert could be.

    Only use monitors if you can provide a counter-event indicating the error is cleared. Depending on the complexity of your management pack, these monitors can be attached to an authored class (eg adapter).

    Monitors can be nice of they can self-heal and you have the time to set them up (against an own class preferably), but rules are the quickest way to provide centralized monitoring for your application.

    Tuesday, March 20, 2012 4:30 PM
  • If you really want to control which kind of alert is triggered per event, a dedicated rule is the best option.

    I got this approach but in my case if I am having 1400+ rule (script based) running at once then won't this overburden my system?


    Regards,
    Ravi

    Wednesday, March 21, 2012 12:04 PM
  • Regarding monitoring, no, there will be no impact. You will still define one trigger for an event, just more specific than a "catch-all" rule.

    If you want to prevent alert-storms (if an event occurs 100 times in one minute, this will fill-up your SCOM with alerts), you can look into the alert-suppression settings of the rules you create. This way you can let alerts increase their internal counter, instead of generating a new alert for every event.

    Wednesday, March 21, 2012 12:40 PM
  • If I can do monitoring 1400+ alerts from monitors, then should I have to write 1400 different monitors (getting its data from one common script)? If the do this, then how to determine health state for a component via monitor if suppose 500+ monitors target to one component.


    Regards,
    Ravi

    Wednesday, March 21, 2012 1:51 PM
  • You can simplify your health model by dividing your events into low, medium and high importance events (if this is possible). You could then create 3 monitors or rules targetting each category of events. You could then pass your event-data in order to give a more detailed description of the event that triggered the monitor/rule.

    To be able to make your monitors auto-resolving, you must extend your script to also return a value on a healthy condition.

    Also see: http://technet.microsoft.com/en-us/library/ff629449.aspx

    This will show you how to pass data from your script to the monitors/rules.

    Wednesday, March 21, 2012 3:10 PM
  • Currently my architecture does not support generating events and I am not using any agent which generate this alerts in events (then after I can create windows event rule, which will be much faster). But I am reading these alerts from a file (say XML file).

    Regards,
    Ravi

    Friday, March 23, 2012 8:13 AM
  • you return a property bag. The datasource doesn't matter... You just shouldn't give a value to things in your datasource, that's for your rule/monitor(type)...

    it's a lot better practice to create rules/monitors per situation and not try and group them together. (e.g. 1 that bothers me is a kerberos catch all rule from AD, in a certain setup this alert triggers on a "known issue". However i can only disable it as a whole, so i would miss out on important kerberos errors).

    Also have 2000 possible "events" doesn't mean they are all relevant. Use SCOM to alert operators about problems, but have them figure out the exact issues. So don't try to pick them all up, only pick up the ones that really mean something for an operator.


    Rob Korving
    http://jama00.wordpress.com/

    Saturday, March 24, 2012 11:00 AM
  • I have tried this out, I have created 504 rules(script based as there are 504 different types of faults in my case), all having one datasource (VBScript). This script scans all the faults (Fault number provided by rules) and displays in SCOM. This method I found is a bit exhaustive as

    a) It makes CPU utilization way too high (as 504 scripts running at once).

    b) If there are more that 50 faults with same fault number then it generates alert that "A rule has generated 50 alerts in the last 60 seconds."

    c) Even there are some faults are present (100 out of 504) in cache file, all the rule are running at once, which is not required (I want only 100 rules to run).

    I need to overcome come this issue and make my MP way more smart. I figured out a way:

    a) For running 504 scripts at once what I wanna do is to generate windows event by one script. This script connects my servers and fetches all the alerts from there (say total x numbers of faults are fetched) and generates x different events. And then I make my rule to fetch windows event and not run any script. This way I overcome running script 504 times.

    b) I am still stuck at this method as i suppose there are 100 faults having same fault code and my rule which fetch this fault, this will generate more than 50 alerts in SCOM and I guess I will get the error mention above.

    c) Can a rule trigger another rule(s) to run? If this possible then I can make one rule which scans all different kind of faults and trigger only those rules.

    What I can do to overcome this issue? Is there any better way.

    In my case, for fetching alerts I have to use a script and store it into cache file or something. I am not using using any custom agent (which generates events). If I have to populate windows event then also I have to do this by script way.


    Regards,
    Ravi

    Wednesday, March 28, 2012 5:48 AM
  • When your script runs for every rule you don't have cookdown... no need to generate windows eventlog events.

    The second i don't get. You've created 504 rules yet one rule triggers more than 50 times in 1 minute... If that means that a certain error could be in there 50+ times windows events won't help. If that event is in there once, you've done something wrong...

    Datasource - script - returns property bag of all errors (i would prefer to return all possible errors and give it a state like ok/bad or a count of how many times it occured; this will make monitors possible too)

    Rule - Filters one a certain error only (contains filter to only pickup events from the datasource that you are interesting in and ignores the rest).

    http://myitforum.com/cs2/blogs/vdipippo/pages/workflow-cook-down-in-operations-manager-2007.aspx


    Rob Korving
    http://jama00.wordpress.com/

    Wednesday, March 28, 2012 6:33 AM
  • Yes in my scenario there can be more than 50 or 100 alerts (all different) having same fault code. I am passing fault code in script (provided by rule). The script parses the cache and fetches all faults having same fault code at once and packs them into property bag.

    The problem lies here.


    Regards,
    Ravi

    Wednesday, March 28, 2012 6:49 AM
  • So you want 50-100 alerts or just 1 when the status code is the same?

    As the first means you need to have a filter so you can distinguish between the alerts other than statuscode (you obviously can as you say they are different). When you just want 1 alert when this statuscode happens 1+ times, i suggest you change your script to just count the occurences and return 1 bag per fault code with a property count and its value.


    Rob Korving
    http://jama00.wordpress.com/

    • Marked as answer by Yog Li Tuesday, April 3, 2012 3:54 AM
    Wednesday, March 28, 2012 8:05 AM
  • Yes I have all faults with same fault code but different parameters. So these type of faults I have to create separate rules?

    Again, What If I want to trigger my rule(s) with some other rule?


    Regards,
    Ravi

    Wednesday, March 28, 2012 9:06 AM