locked
Invalid monitored object health state due to someone closing a monitor RRS feed

  • Question

  • I've had this problem for a long time where the SCOM operators will manually close a monitor. The monitor goes away but the ghost of the monitors error or warning health state remain in the health explorer. This is bad because now I won't get any more alerts if the error condition returns, or maybe it was never fixed in the first place. No matter how many times you tell people not to do this type of stuff, they still do it.

    I came up with what I think might be the makings of a decent solution to this problem. I wrote a script that checks object health against active monitors that if set up to do so could reset the health state. In my travels and searches I haven't really come across much that speaks to this issue and fixing it or stopping it all together so my apologies if this is a repeat issue/question.

    I've included the script below, its written to just report its findings and not take action at this time. I did some minor testing and so far it seems to work well. Any input as to whether this would be a good or bad solution would be appreciated. Or, if somone has a better idea all together that would be great too. Thanks in advance!

    cls
    # ************************************************************************************************************************************
    # Function Recurse-Health accepts a MonitoringObjects heirarchy Childnodes.
    # The function recurses through the heirarchy and returns the bottom level Childnode in the heirarchy
    # ************************************************************************************************************************************
    Function Recurse-Health {

      Param ($ChildNodes)
       
        ForEach ($ChildNode in $ChildNodes) {
          $ChildHealth = $ChildNode.item.healthstate
          If ($ChildHealth -eq "Error" -or $ChildHealth -eq "Warning") {
            $totalchildnodecount = $ChildNode.totalchildnodecount
           
            # Does the ChildNode have a getexternalmonitoringstatehierarchies property (aka health rollups)? If so use that instead of the childnodes property
            Try {
              $getexternalmonitoringstatehierarchiesTest = $ChildNode.item.getexternalmonitoringstatehierarchies()
            }
            Catch {
              $getexternalmonitoringstatehierarchiesTest = $False
            }
           
            # If there is more stuff to go through, keep going       
            If ($totalchildnodecount -ne 0 -and $getexternalmonitoringstatehierarchiesTest -eq $False) {
             $ChildNodes = $ChildNode.childnodes
             Recurse-Health -ChildNodes $ChildNodes
            }
            # Plow through health rollups, keep going
            ElseIf ($getexternalmonitoringstatehierarchiesTest -ne $False) {
              $ChildNodes = $ChildNode.item.getexternalmonitoringstatehierarchies()
              Recurse-Health -ChildNodes $ChildNodes
            }
            # Hey, i found one, return it!
            Else {
              $ChildNode
            }  
                      
          } # End If ($ChildHealth -eq "Error") {
        } # End ForEach ($ChildNode in $ChildNodes) {
     
    } # End Function Recurse-Health
    # ************************************************************************************************************************************

    # Begin Script
    # ************************************************************************************************************************************
    # ************************************************************************************************************************************
    Add-PSSnapin "Microsoft.EnterpriseManagement.OperationsManager.Client" -ErrorAction SilentlyContinue

    $RootMS = ##########RMS GOES HERE#############
    $originalPath = Get-Location
    Set-Location "OperationsManagerMonitoring::" -ErrorVariable errSnapin
    $null = New-ManagementGroupConnection -ConnectionString $RootMS -ErrorVariable errSnapin
    Set-Location $RootMS -ErrorVariable errSnapin

    $oTable = @()

    # Get all the servers in scom
    $MC = Get-MonitoringClass | Where { $_.name -eq "Microsoft.Windows.Server.Computer" }
    # Get all the Servers that have an Error or Warning Health State
    $MO = $MC | Get-MonitoringObject | Where {$_.healthstate -eq 'Error' -or $_.healthstate -eq 'Warning'}

    # Get all the Alerts in SCOM that are new or open and is a monitor alert
    $Alerts = Get-Alert | Where { ($_.ismonitoralert -ne $FALSE) -and (($_.resolutionstate -eq "0") -or ($_.resolutionstate -eq "128")) }

    # Start iterating through the servers to fix and report on suspect server health states
    ForEach ($Object in $MO) {
     
      # get just the server name, drop the rest of the junk on there
      $ServerName = ($Object.displayname).split(".")[0]
     
      # write server name (debugging stuff)
      $ServerName
      Write-Host " "
     
      # Get Alerts for ther Server from $Alerts
      $AlertObjects = $Alerts | Where { ((($_.monitoringobjectpath -like "*$ServerName*") -or ($_.monitoringobjectname -like "*$ServerName*") -or ($_.MonitoringObjectDisplayName -like "*$ServerName*") -or ($_.PrincipalName -like "*$ServerName*")) -and (($_.MonitoringObjectHealthState -eq "Error") -or ($_.MonitoringObjectHealthState -eq "Warning"))) }
       
      # Get the Health Explorer Heirarchy
      $Hierarchy = $Object.getmonitoringstatehierarchy()
      # Get the Top level child nodes of the Heirarchy
      $ChildNodes = $Hierarchy.childnodes
       
      # Call the recurse-health function pass in the starter nodes to find the monitor(s) in the hierarchy with error/warning state
      $MonitorObjects = Recurse-Health -ChildNodes $ChildNodes
       
      ForEach ($MonitorObject in $MonitorObjects) {
     
        # Create all kinds of variables for comparisons and stuff
        $MonitorName = $MonitorObject.item.monitorname
        $MonitorID = $MonitorObject.item.monitorID
        $Monitor = Get-Monitor $MonitorID
        $MonitorProblemID = $Monitor.id
        $HealthState = $MonitorObject.item.healthstate
        $AlertOnState = (Get-Monitor $MonitorObject.item.monitorid).alertsettings.AlertOnState
         
        # set bAlert to false and see if you find a match
        $bAlert = "False"
         
        ForEach ($AlertObject in $AlertObjects) {
           
          # check and see if there is an alert to match the object health
          $AlertProblemId = $AlertObject.problemid
          If ($MonitorProblemID -eq $AlertProblemId) {
            # Found a match its a legit health status move on
            $bAlert = "True"
            $MatchedAlert = $AlertObject.name
          } # End If ($MonitorProblemID -eq $AlertProblemId) {
     
        } # End ForEach ($AlertObject in $AlertObjects) {
         
        #insurance against processing objects going healthy while the script runs
        If ($MonitorName -eq $null) {
          $bAlert = "True"
          $MatchedAlert = "Healthstate went green while processing"
        } # End If ($MonitorName -eq $null) {
         
        # The health status for this monitor object doesnt have a corresponding active alert, reset that mofo, report on it
        If ($bAlert -eq "False") {
         
          $MonitorName
          Write-Host " "
           
          # Determine if the monitor is even set to alert, if its not; skip it, but report on it      
          If ($HealthState -eq $AlertOnState) { # No matching alert found, this monitor should create an alert if an error condition exists, reset health
           
            Write-Host "Alert on state: " $AlertOnState
            Write-Host "Health state: " $HealthState
           
            Write-Host " "
            $oRow = New-Object psobject
            $oRow | Add-Member -MemberType NoteProperty -name ServerName -value $ServerName
            $oRow | Add-Member -MemberType NoteProperty -name Status -value "Not Legit Alert/Health - Reset Health"
            $oRow | Add-Member -MemberType NoteProperty -name Description -value $MonitorName
            $oRow | Add-Member -MemberType NoteProperty -name AlertOnState -value $AlertOnState
            $oRow | Add-Member -MemberType NoteProperty -name HealthState -value $HealthState
            $oTable += $oRow
           
            # Reset the health
            #$MonitorObject.item.reset() 
            Write-Host "-------------------------------------------------------------------------------------------------------------------------------"
           
          } # End If ($HealthState -eq $AlertOnState) {
         
          Else { # No matching alert, however there is a suspect health error/warning state where alert is not generated
           
            Write-Host "Alert on state: " $AlertOnState
            Write-Host "Health state: " $HealthState
           
            Write-Host " "
            $oRow = New-Object psobject
            $oRow | Add-Member -MemberType NoteProperty -name ServerName -value $ServerName
            $oRow | Add-Member -MemberType NoteProperty -name Status -value "Suspect - Check for warning or error health state but no alert configured"
            $oRow | Add-Member -MemberType NoteProperty -name Description -value $MonitorName
            $oRow | Add-Member -MemberType NoteProperty -name AlertOnState -value $AlertOnState
            $oRow | Add-Member -MemberType NoteProperty -name HealthState -value $HealthState
            $oTable += $oRow
            
            Write-Host "-------------------------------------------------------------------------------------------------------------------------------"
           
          } # End Else

        } # If ($bAlert -eq "False") {
         
        # The health status for this monitor object has a corresponding active alert, move along, nothing to see here
        Else {
           
          Write-Host "Found a Legit Alert"
          Write-Host "Legit"
          $MatchedAlert
          Write-Host " "
          $oRow = New-Object psobject
          $oRow | Add-Member -MemberType NoteProperty -name ServerName -value $ServerName
          $oRow | Add-Member -MemberType NoteProperty -name Status -value "Legit Alert/Health - Take No Action"
          $oRow | Add-Member -MemberType NoteProperty -name Description -value $MatchedAlert
          $oRow | Add-Member -MemberType NoteProperty -name AlertOnState -value $AlertOnState
          $oRow | Add-Member -MemberType NoteProperty -name HealthState -value $HealthState
          $oTable += $oRow
          write-Host "-------------------------------------------------------------------------------------------------------------------------------"
           
        } # End Else {
         
         
      } # End ForEach ($MonitorObject in $MonitorObjects) {
     
    } # ForEach ($Object in $MO) {

    # export log
    If ((Test-Path c:\temp) -eq $False) { md c:\temp  }
    $oTable | Export-CSV c:\temp\ServerHealthResult.csv -notypeinformation

    Read-host "Results are located here: c:\temp\ServerHealthResult.csv. Press Enter to continue"



    Friday, May 17, 2013 9:54 PM

All replies

  • Hello,

    The following thread seems disscussed a simialr issue:

    Automatic Alert Resolution closes the alerts but doesn't reset the Health State !
    http://social.technet.microsoft.com/Forums/en-US/operationsmanagergeneral/thread/2e110a7b-63ee-4106-89bd-fff76da05f0a/

    Please check if it can help you.

    Thanks,


    Yog Li
    TechNet Community Support

    Tuesday, May 21, 2013 10:11 AM
  • Thank you for the extra info on the subject. One of the links in the forum went to a blog about a connector someone wrote which seemed interesting but i was unable to download it from work. Could be its blocked here for some reason. I'll have to check it out from home.

    I think I’ll step through each orphaned health state the script finds and make sure my logic is sound and that it sets the state as expected. I'll report back with how well it did. If no one knows of a better way to fix or prevent this I’ll edit and pretty up my script a bit and maybe it can at least help someone else in a similar predicament.

    Tuesday, May 21, 2013 5:28 PM
  • I found an issue in my testing. Overall it works pretty well, however it doesn't account for monitors that effect the health state but do not alert. Sometimes for whatever reason you'll have a monitor that is overridden to not alert or a three state monitor where maybe it only alerts at critical. Both of these circumstances throw a small wrench in the works.

    I'll see if i can work some logic in for that and re-post my findings. I think if that can be handled and maybe spit out a log for review on those circumstances that everything will be hunky dory.

    Wednesday, May 29, 2013 5:18 PM
  • I updated the script in the first comment, its not perfect yet but I think its improved.

    I also thought that another problem might be with rollup alerts and it not being able to account for that either. The rabbit hole seems never ending. The current script will report and not action upon health states it finds to be suspect. Actually it wont action on anything, that part is commented out. If the "alert on state" from the monitor and the "health state" from the object dont match it leaves it alone. It actually found a few misconfigured alerts for me which was nice. I havent figured out how to take into account overrides and health rollups that cause alerts. Still a work in progress..

    Thursday, May 30, 2013 2:46 PM