none
What to Automate in your Monitoring Tools RRS feed

  • General discussion

  • Hello:

    I wanted to start a discussion on the types of automatic remediations that you recommend to turn on in your event monitoring tool.    Automatic responses to events and incidents will increase an organization's ability to respond to incidents, reduce the downtime of incidents and improve process maturity overall.

    To start, not that I am a technical expert, these are some of the automated actions I would implement in my monitoring tool:

    Event = running out of database extents, 
        Automated action = create incident ticket, increase extents and close alert/incident ticket
    Event = running out of disk space, 
        Automated action = create incident ticket, then delete all temp files, close alert/incident ticket

    So what others could we add as a best practice for our Customers?

    Monday, May 5, 2008 7:02 PM

All replies

  • Hi Kathleen,
    This is some examples used by me:

    Event = secutiry logs clean,  
        Automated action = create incident ticket, send a e-mail message to security administrator, close alert/incident ticket

    Event = new computer add on domain,  
        Automated action = create incident ticket, send a e-mail message to reginal domain administrator, close alert/incident ticket

    Event = heartbeat don't respond,  
        Automated action = create incident ticket, send a Ping command to the server, if success close alert/incident ticket or if fail send a e-mail message to reginal domain administrator

    and more, this your idea for this post is great, because I'm use SCOM now and configuring this.
    Best Regards

    Cleber Marques
    MOF Brasil Project
    www.clebermarques.com

    Monday, May 5, 2008 7:19 PM
    Moderator
  • Monitoring has many aspects and all must be considered in the light of an information achitecture for reporting i.e.:

    • What
    • When
    • Where
    • Why
    • How

    I use my own 4 layer model to represent this, understanding all the time that the reason you measure is to report so that you can manage & improve your service delivery  ("you can't manage what you can't measure"). Management needs metrics and the metrics are derived from a reporting model.

    So what does the model look like? It has 4 layers, corresponding to

    1. CEO, - need information to manage the business
    2. CIO, - needs information to manage IT
    3. Service Manager, - needs information to manage services
    4. Operator, - needs information to manage events

    Each layer gets input from the layer below PLUS additional information for that level. So L4 will get event from the event monitoring systems (such as SCOM), L3 gets L4 consolidated data plus service performance feedback from service customers (business), L2 gets consolidated L3 information and IT performance information (such as Human resources, Finances, Business sat etc., L1 gets L2 information plus key statistics from others lines of business such as marketing, finance, sales etc.

    A key component is that all layers rely on the layers ABOVE to specify what they need :-) - This means that before you can decide what event related data to capture at the L4 stage - the L3 layer must define the reporting that is needed to satisfy their needs (at this layer it will be service level agreements that should state the information required) and so on for all the layers.

    Ultimately the measurements that are then setup are preciesely what you need and not a bunch of "cool" stuff to measure :-)

    Cheers Shane

    Tuesday, May 6, 2008 12:59 PM
  • Hi,
    another very common and important step within the alarm to ticket creation automation that I come across in my customer scenarios is gathering information about the server related to the event. So once a ticket is created the necessary server information is collected and automatically available to the operator/technician in various fashions. This information could for example be:
    • Server name/location
    • SLA related to the Server in question
    • System Owner/Responsibles - contact details
    • Previous incident history
    • Special information related to this server (and maybe even event), for example run this particular script when this happens etc...

    Now this requires some development and configurations in order to setup/integrate/correlate various sources and information, but eventually it will provide the monitoring operator/technician with a heads-up about the system with minimum human interaction and time consumption.

    rgds,
    Klaus

    Tuesday, May 6, 2008 4:46 PM
  • In the new Team SMF the role of the Technology Area Manager is introduced. This role is responsible for the performance of the specific infrastructure components assigned to him/her. The basic idea here is the there are two aspects of importance to service delivery: the factual performance of the components and the subjective perception of the service by the end user. responsibility for the performance of the infrastructure components can be delegated from the operations manager (who is accountable) to the technology area manager. The Service (Level) Manager is of course responsible for managing the user perception (together with the Support Manager and the Support Team).
    In the regular Operations Review meeting the Technology Area Manager will have to report on the performance. This can then be input for the review meetings with the customers that the Service (level) Manager will have to perform.
    So, what does the Technology Area Manager need to know in order to report truthfully to the Operations Manager and the Service (Level) Manager on the past performance? That is the question.
    There is another thing to consider. The Technology Area Manager is responsible for the operational performance of the infrastructure components and therefore within the limitations of capacity management. It takes sometimes several months to actual update an infrastructure (expand storage systems, install new switches, etc.). During this time the infrastructure needs to perform and here lays the responsibility of the Technology Area Manager: to keep the systems running smoothly.

    Paul Leenards
    Getronics Consulting
     
    Tuesday, May 6, 2008 7:25 PM
  • Hi Kathleen

    There are lots of things you should consider when you create automated actions.

    In my small world creating automated actions should follow your standard change rules proceudres. This means that the risk and impact must be small and the outcome must be known of the automated action.

    Many compaies have chosen to restart some windows services if it stops. This is one of the examples where you want to think really hard before doing that.

    The outcome of starting a service if it stops is not always known and should therefor be concidered a risk.

    In this example restarting a windows service can be fatal if you are trying to update a SQL database to SP1. If the service starts during the upgrade the installation will fail. For the people installing the SP1 it can be a really hard job finding out what or who started the service and trooubleshooting can take really long time. So this could be shooting your self in the foot.

    So you must know the outcome and have a low risk before you start thinking about doing automated actions.

    Deleating logs or all logs might not always be a good idea, how should you then understand what's going on. Will problem management need these??

    Just some thoughts

    I do agree with you all that the automated action should be traced in the Service desk tool

    My 5 c

    Thanks

    Anders
    Tuesday, May 20, 2008 7:00 PM