none
Specific Management Server causes Linux heartbeat failures RRS feed

  • Question

  • We have an environment with 2 management servers and 2 gateway servers, and whenever one of the management servers is added to a Unix/Linux resource pool, any Unix/Linux client server that attempts to connect to that server begins to suffer heartbeat failures.  Here are the steps/information that have been gathered so far:


    • Current SCOM environment is SCOM 2016 UR3 universally.  The linux servers are CentOS 6.5 and 7.  Management packs are either 7.6.1064.0 or 7.6.1076.0 depending on whether the web package contained updates for them.
    • Certificates have been exchanged from all SCOM servers - this has been double checked
    • Putting the offending management server into its own resource pool and using it to install the SCOM agent/configure a Linux/Unix server will result in the server being configured and discovered, but then after about ~5 minutes the heartbeat begins to fail.
    • Moving the monitored client server from the situation described above to a resource pool without the offending management server results in the client heartbeating again.  Moving it back to the original resource pool with the broken management server will cause heartbeats to fail again.
    • The problem management server has its firewall turned off and it also has 238 Windows client agents using it as their primary management server without issue.  I can also ping from the Linux/Unix servers to this problem management server without issue.  It has been rebooted recently with no effect.

    I'm at a bit of a loss as to where to go from here.

    Tuesday, July 25, 2017 4:28 PM

All replies

  • Run the below from the problematic Management Server:

    winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_Agent?__cimnamespace=root/scx -username:<UNIX/Linux user> -password:<UNIX/Linux password> -r:https://<UNIX/Linux server>:1270/wsman -auth:basic -encoding:utf-8

    Replace the username, password and servername as per your environment, and let us know the result.

    another place to check is the Run As account distribution. Make sure the Unix/Linux Run As account is distributed to the problematic MS as well.

    Thanks!

    Monday, July 31, 2017 2:47 AM
  • Output looks to be the same from both management servers, I tested with the monitoring account.  I can confirm that the monitoring and maintenance accounts are both distributed to the correct resource pools and the actual servers individually (as part of troubleshooting).

    InstanceID = null
    Caption = SCX Agent meta-information
    Description = Release_Build - 20170411
    ElementName = null
    InstallDate
        Datetime = 2017-07-25T15:44:45Z
    Name = scx
    OperationalStatus = null
    StatusDescriptions = null
    Status = null
    HealthState = null
    CommunicationStatus = null
    DetailedStatus = null
    OperatingStatus = null
    PrimaryStatus = null
    VersionString = 1.6.2-339
    MajorVersion = 1
    MinorVersion = 6
    RevisionNumber = 2
    BuildNumber = 339
    BuildDate = 2017-04-11T00:00:00Z
    Architecture = x64
    OSName = CentOS Linux
    OSType = Linux
    OSVersion = 7.0
    KitVersionString = 1.6.2-339
    Hostname = afnbrkacsbackup.afni.net
    OSAlias = UniversalR
    UnameArchitecture = x86_64
    MinActiveLogSeverityThreshold = INFO
    MachineType = Virtual
    PhysicalProcessors = 1
    LogicalProcessors = 1

    Monday, July 31, 2017 2:03 PM
  • Hi,
    we had this week a simliar problem and was caused by new certificates. After UR2 new server will get a SHA256 certifiate and not and old SHA1 one. Do you have exchanged the new certificates too?

    https://blogs.technet.microsoft.com/momteam/2017/03/01/deprecating-sha1-certificates-in-system-center-operations-manager-for-unixlinux-monitoring/

    Thursday, August 3, 2017 9:55 AM
  • Looking at the server's Linux certs, both the working one and non working one are showing SHA256 for their "xplat" certs that we exchanged between the management servers, and were generated 7/21/2017.  One thing I do notice when I do the scxcertconfig.exe -list, it actually shows 2 extra certs that aren't on the other management server or the gateway.

    1: CN=SCX-Certificate, T=SCX94a1f46d-2ced-4739-9b6a-1f06156ca4ac, DC=WorkingServer        not-before: 07/21/2017 21:02:00; not after: 07/21/2027 21:02:00; (No private key container)
    2: CN=SCX-Certificate, T=SCX94a1f46d-2ced-4739-9b6a-1f06156ca4ac, DC=WorkingGateway    not-before: 07/24/2017 20:09:36; not after: 07/24/2027 20:09:36; (No private key container)
    3: CN=SCX-Certificate, T=SCX94a1f46d-2ced-4739-9b6a-1f06156ca4ac, DC=NotWorkingServer        not-before: 07/21/2017 21:00:37; not after: 07/21/2027 21:00:37; Private key container: {8DB6AE2B-23F4-4E28-BC96-4D71B66AC08C}
    1: CN=SCX-Certificate, T=SCX633376D2-E3E2-4f31-8461-D09259ACEF3D, DC=NotWorkingServer        not-before: 04/11/2016 15:40:46; not after: 04/11/2026 15:40:46; (No private key container)
    2: CN=SCX-Certificate, T=SCX633376D2-E3E2-4f31-8461-D09259ACEF3D, DC=WorkingGateway        not-before: 04/11/2016 16:01:53; not after: 04/11/2026 16:01:53; (No private key container)

    I removed the bottom two certs using the -remove command but it also removed the current SCX-Certificate used by the management server.


    • Edited by JN1226 Thursday, August 3, 2017 5:08 PM Figured out how to remove them
    Thursday, August 3, 2017 5:03 PM
  • Removing those two old certs appears to have fixed the issue.  I connected all the linux servers to just that SCOM server and have experienced no issues since.  The only problem I have now is that there isn't a private key container for the key I had to re-import for that server.  I'm wondering if one will be auto generated the next time I need to sign a cert for a linux server.  There isn't a -generate option.
    • Edited by JN1226 Thursday, August 3, 2017 7:26 PM
    Thursday, August 3, 2017 7:26 PM
  • Hi all.

    I have the same issues, two Management Servers, but only one with the Heartbeat issue.

    I checked the certificates, and both servers are ok.

    Any suggestion or idea? Thanks!

    Friday, September 6, 2019 12:44 PM
  • Hi all.

    I have the same issues, two Management Servers, but only one with the Heartbeat issue.

    I checked the certificates, and both servers are ok.

    Any suggestion or idea? Thanks!

    Hi,

    may I ask you to open a new thread about this, post some additional details about your environment and also related event from the Operations Manager event log from your Management Server that are part of the Linux monitoring resource pool.

    Thnaks and Regards,


    (Please take a moment to "Vote as Helpful" and/or "Mark as Answer" where applicable. This helps the community, keeps the forums tidy, and recognizes useful contributions. Thanks!) Blog: https://blog.pohn.ch/ Twitter: @StoyanChalakov

    Sunday, September 8, 2019 7:56 PM
    Moderator