none
SCOM monitoring health state change alerts are very high.... RRS feed

  • Question

  • SCOM monitoring health state change alerts are very high and it's leads to too much of data load to two of our management servers went grayed .

    Below are the data for last 7 days alerts count

        Counts  Alerts
    8492  Monitoring Host Private Bytes Threshold Microsoft.SystemCenter.Agent.MonitoringHost.PrivateBytesThreshold
    3502  Monitoring Host Handle Count Threshold Microsoft.SystemCenter.Agent.MonitoringHost.HandleCountThreshold
    2513  Asynchronous Group Policy Setting Causing Delay Monitor Microsoft.Windows.GroupPolicy.2008.Runtime.ApplicationofGroupPolicy.System.ForcesynchronousGroupPolicyprocessing.EventBased.UnitMonitor
    2308  Certificate lifespan SystemCenterCentral.Utilities.Certificates.CertificateAboutToExpire.Monitor
    2237  Certificate validity SystemCenterCentral.Utilities.Certificates.CertificateValidity.Monitor
    1974  Average Logical Disk Seconds Per Transfer Microsoft.Windows.Server.10.0.LogicalDisk.AvgDiskSecPerTransfer
    1621  Available Megabytes of Memory Microsoft.Windows.Server.2008.OperatingSystem.MemoryAvailableMBytes
    1546  KHI: Exchange Control Panel connectivity (External) transaction failures. _45CBC307_03C8_44E3_B504_79D338F98F6D_
    1149  Average Logical Disk Seconds Per Transfer Microsoft.Windows.Server.2008.LogicalDisk.AvgDiskSecPerTransfer

    Help needed here...


    Thanks, Shiva ravichandran.

    Friday, September 13, 2019 5:09 PM

Answers

  • Hi Shiva,

    agree with Leon on tuning the thresholds, but it could be that there are also other reasons for the management servers going grey. Hoiw long do they stey in this state? What do you do to bring them up and Healthy again?

    Did you check the events on the affected servers at the time they go grey?

    Usualy, when the health service reaches the thresholds, it restarts and everything should be fine for a while (until the threshld is breached again), but the servers should not go grey. 

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, September 13, 2019 8:06 PM
    Moderator

All replies

  • Hello Shiva,

    The default thresholds for many monitors are set to very low values, and this can cause a flood of alerts. You should re-configure the thresholds for the Monitoring Host Private Bytes Threshold and the Monitoring Host Handle Count Threshold, you can refer to Kevin's blog post over here:

    https://kevinholman.com/2017/05/29/stop-healthservice-restarts-in-scom-2016/

    For the other alerts that have a high count, I suggest you tune the thresholds as well (if they have thresholds).

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Friday, September 13, 2019 6:18 PM
  • Hi Shiva,

    agree with Leon on tuning the thresholds, but it could be that there are also other reasons for the management servers going grey. Hoiw long do they stey in this state? What do you do to bring them up and Healthy again?

    Did you check the events on the affected servers at the time they go grey?

    Usualy, when the health service reaches the thresholds, it restarts and everything should be fine for a while (until the threshld is breached again), but the servers should not go grey. 

    Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, September 13, 2019 8:06 PM
    Moderator
  • Hi Leon Laude 

    Appreciate your quick response 

    the above link which you've shared is helped me a lot and I've change the threshold value as well...

    I guess the above solution would be resolve for the  below 2 state change events but the other things I've attached is still there to look into it

    ( 8492  Monitoring Host Private Bytes Threshold Microsoft.SystemCenter.Agent.MonitoringHost.PrivateBytesThreshold
    3502  Monitoring Host Handle Count Threshold )

    2513  Asynchronous Group Policy Setting Causing Delay Monitor Microsoft.Windows.GroupPolicy.2008.Runtime.ApplicationofGroupPolicy.System.ForcesynchronousGroupPolicyprocessing.EventBased.UnitMonitor
    2308  Certificate lifespan SystemCenterCentral.Utilities.Certificates.CertificateAboutToExpire.Monitor
    2237  Certificate validity SystemCenterCentral.Utilities.Certificates.CertificateValidity.Monitor
    1974  Average Logical Disk Seconds Per Transfer Microsoft.Windows.Server.10.0.LogicalDisk.AvgDiskSecPerTransfer
    1621  Available Megabytes of Memory Microsoft.Windows.Server.2008.OperatingSystem.MemoryAvailableMBytes
    1546  KHI: Exchange Control Panel connectivity (External) transaction failures. _45CBC307_03C8_44E3_B504_79D338F98F6D_
    1149  Average Logical Disk Seconds Per Transfer


    Thanks, Shiva ravichandran.

    Friday, September 13, 2019 8:27 PM
  • As Stoyan said, you will also need to check the root cause of these alerts, are you experiencing problems or do they require more tuning.

    The following ones:

    • Average Logical Disk Seconds Per Transfer
    • Available Megabytes of Memory
    • Average Logical Disk Seconds Per Transfer

    These may happen a lot when systems/databases are busy, it's perfectly normal, but you should also verify that the systems are healthy. If these can be considered as healthy in general, you might have to tune the thresholds a bit to higher thresholds.

    As for the certificate alerts, these occur because you actually have certificates that are expiring soon and the lifespan of the certificate is ending soon, these should be checked. If you consider the thresholds to be too low, you can higher them, but certificates are important so you should definitely take a closer look into these.

    Sometimes there are expired certificates that haven't been cleaned up, you should make sure they get cleaned up so that SCOM won't alert on them.

    The Exchange alert indicates that some of the transactions during the Exchange Control Panel connectivity test failed, I suggest you review the event logs, ensure all services are running and check if Exchange is healthy.

    Also check with the service responsible for these systems, like Exchange if this alert is relevant/necessary to have.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Friday, September 13, 2019 8:39 PM
  • Hi Stoyan Chalakov,

    I thought it'd because of the more load and it leads to the management server get grayed state and we've moved more number of agents from the affected management server to other management servers and I found that the most of the Gateway servers are reporting to the affected management servers and it's into PROD environment hence I couldn't do any further investigation on this and then I couldn't do the gateway servers reporting failover as well...

    I got the Event error's in the grayed management servers like DB and Data warehouse data insertion is not happening properly hence I did the agent migration steps...

    the event id's below are...

    ---------------------------------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        Health Service Modules
    Date:          6/25/2019 2:35:26 PM
    Event ID:      31551
    Task Category: Data Warehouse
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sdffas
    Description:
    Failed to store data in the Data Warehouse. The operation will be retried.
    Exception 'InvalidOperationException': The given value of type String from the data source cannot be converted to type nvarchar of the specified target column. 

    One or more workflows were affected by this.  

    Workflow name: Microsoft.Exchange.15.MailboxStatsSubscription.Rule 
    Instance name: sfdfsfdaf
    Instance ID: {1EE4544E-32BC-65BB-D4A1-E7525C61C10C} 
    Management group: sdfsf

    ----------------------------------------------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        OpsMgr Connector
    Date:          6/25/2019 2:34:36 PM
    Event ID:      20034
    Task Category: Availability
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sdfsf
    Description:
    The health service {D3B679C2-390B-69A3-C8DA-24EE57C3CD0F} running on host sdfsafdas and serving management group sfdsadf with id {D7B0417C-E590-184F-0767-23427F4A27A1} is not healthy.  Entity state change flow is stalled with pending acknowledgement.


    ----------------------------------------------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        OpsMgr SDK Service
    Date:          6/25/2019 2:32:18 PM
    Event ID:      26319
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sfdsdafsda
    Description:
    An exception was thrown while processing GetUserRolesForOperationAndUser for session ID uuid:849e1102-d445-4993-8b4c-618f02e35991;id=22369.
     Exception message: Value does not fall within the expected range.
     Full Exception: System.ArgumentException: Value does not fall within the expected range.
       at Microsoft.EnterpriseManagement.Interop.Security.Auth.IAzApplication2.InitializeClientContextFromStringSid(String SidString, Int32 lOptions, Object varReserved)
       at Microsoft.EnterpriseManagement.Mom.Sdk.Authorization.AzManHelper.GetScopedRoleAssignmentsForUser(Int32 operationNumericId, String userName)
       at Microsoft.EnterpriseManagement.Mom.Sdk.Authorization.AuthManager.GetUserRolesForOperationAndUser(Guid operationId, String userName)
       at Microsoft.EnterpriseManagement.Mom.Sdk.Authorization.AuthorizationService.GetUserRolesForOperationAndUser(Guid operationId, String userName)
       at Microsoft.EnterpriseManagement.ServiceDataLayer.SecurityConfigurationService.GetUserRolesForOperationAndUser(Guid operationId, String userName)
       at Microsoft.EnterpriseManagement.Mom.ServiceDataLayer.SdkDataAccessBackCompatProxy.GetUserRolesForOperationAndUser(Guid operationId, String userName)

    ----------------------------------------------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        OpsMgr Connector
    Date:          6/25/2019 2:38:17 PM
    Event ID:      20038
    Task Category: Availability
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sdfsadf
    Description:
    The health service {D3B679C2-390B-69A3-C8DA-24EE57C3CD0F} running on host sdfasfd and serving management group sdfsafd with id {D7B0417C-E590-184F-0767-23427F4A27A1} is not healthy.  Alert flow is stalled with pending acknowledgement.

    ----------------------------------------------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        HealthService
    Date:          6/25/2019 2:45:01 PM
    Event ID:      2115
    Task Category: None
    Level:         Warning
    Keywords:      Classic
    User:          N/A
    Computer:      safasdf
    Description:
    A Bind Data Source in Management Group safdasf has posted items to the workflow, but has not received a response in 1500 seconds.  This indicates a performance or functional problem with the workflow.
     Workflow Id : Microsoft.SystemCenter.CollectPublishedEntityState
     Instance    : sfasfdsa 
     Instance Id : {D3B679C2-390B-69A3-C8DA-24EE57C3CD0F}

    ----------------------------------------------------------------------------------------------------------------------------------------
    Warning 8/28/2019 4:56:05 PM DataAccessLayer 33333 None

    Data Access Layer rejected retry on SqlError:
     Request: SqlConnection.Open
     Class: 14
     Number: 983
     Message: Unable to access availability database 'OperationsManager' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the operation again later.


    ------------------------------------------------------------------------------------------------------------------------

    Data was dropped due to too much outstanding data in rule "Microsoft.SystemCenter.Agent.MaintenanceMode" running for instance asfdasfd with id:"{220AFA74-0E0C-EB55-EB3C-04B9D5A7BD23}" in management group sdfsf


    Log Name:      Operations Manager
    Source:        HealthService
    Date:          8/28/2019 4:56:13 PM
    Event ID:      4506
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sdfasfd
    Description:
    Data was dropped due to too much outstanding data in rule "Microsoft.SystemCenter.Agent.MaintenanceMode" running for instance sdfsafd with id:"{220AFA74-0E0C-EB55-EB3C-04B9D5A7BD23}" in management group sdfsaf 


    ----------------------------------------------------------------------------------------------------------------------


    Log Name:      Operations Manager
    Source:        Health Service Modules
    Date:          8/28/2019 5:36:45 PM
    Event ID:      31551
    Task Category: Data Warehouse
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      sfdssdfas
    Description:
    Failed to store data in the Data Warehouse. The operation will be retried.
    Exception 'InvalidOperationException': The given value of type String from the data source cannot be converted to type nvarchar of the specified target column. 

    One or more workflows were affected by this.  

    Workflow name: Microsoft.Exchange.15.MailboxStatsSubscription.Rule 
    Instance name: sdfasfd
    Instance ID: {1EE4544E-32BC-65BB-D4A1-E7525C61C10C} 
    Management group: sdfsafd


    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------


    Log Name:      Operations Manager
    Source:        HealthService
    Date:          9/10/2019 10:01:59 AM
    Event ID:      1202
    Task Category: Health Service
    Level:         Warning
    Keywords:      Classic
    User:          N/A
    Computer:      asdfasdf
    Description:
    New Management Pack with id:"Microsoft.Solaris.10", version:"7.6.1076.0" conflicts with cached Management Pack. Condition indicates wrong server configuration.

    Thanks, Shiva ravichandran.


    Friday, September 13, 2019 8:43 PM
  • I've seen some of these events before, and it was due to some performance issues, however it may also be something else as well.

    I would recommend you to check the state and health of your SQL Server AlwaysOn AG where the SCOM database & data warehouse resides, check the SQL logs as well.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Friday, September 13, 2019 8:54 PM
  • Hi Shiva,

    again I agree with Leon. I am sure the issue is with your databses, so you need to start looking there.

    Like Leon suggested make sure your SQL AlwaysOn Availability Group is healthy. and that you can reach your OperationsManagerDB on it. 

    And because I noticed that there are also Warnings with the ID 2115 and the error 4506, I would defintely recommend following the steps from the following article:

    How to troubleshoot Event ID 2115-related performance problems in Operations Manager

    Please check your SQL AG, follow all steps from the article and then get back to us.

    Thanks,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Friday, September 13, 2019 9:51 PM
    Moderator