none
All manually installed agents suddenly disappeared from "Agent Managed" - SCOM 1807 RRS feed

  • Question

  • Hi !

    There is almost a catastrophy in SCOM environment  8(

    All the manually installed agents for non-domain Windows servers (which has been installed  succesfully)suddenly  disappeared from an "Administration-Agent Managed" pane. But they are still presented in a "Windows Computer"  all in a healthy state ! No alerts at all, I realized it when I tried to get performance info from some of them. I can't start any task on them from console, despite their visual healthy status.

    Those agents have been working for over half of year with no of issues. All certificates installed on them are valid until 2021 . All of them were connected to the same SCOM server.

    Our OpsManager logs exploration from both sides didn't make things clear. From the SCOM server side the log full of messages that servers who aren't a part of a management group tried to establish a connection between. From agent sides  the logs have a similar bunch of messages that a Health service tried to establish a connection with the main SCOM server, then with failover ones but those connections inmmediately closed (Code 20000 as far as I remember). Looks like they have no certificates to connect but those certificates are right on the place as well on all SCOM servers as on all agents. Additionally the healthy status all of those servers in "Windows Computers" and their comlpete absence in "Agent Managed" makes me embarrased. 

    All the rest of agents who works on domain servers work on a normal way.

    I tried to

    - flush and restart Health Service on some of those non-domain servers on a way we all know (stop the agent service, kill the Health folder, start the service). No luck at all.

    -flush and restart Health Service on the SCOM server made with tools provided by an Operations Manager pane. The same sh.t.

    Things looks like a certificate authentication doesn't work at all. But earlier I've seen that when such kind of issue happened server objects immediately turn into grey state and critical alert has raised. Nothing of it here, the objects looked  healthy. No alerts.

    I need some help to resolve it. What may be a root of problem ? Any help would be appreciated

    Thanks in advance.

    Saturday, October 19, 2019 8:19 AM

Answers

  • All the objects were spoiled by a forgotten task from Task Manager on the one of SCOM servers. This task just removed all instances of Microsoft.Windows.Computer class which have their Domain name property not equal to our AD Domain Name. It was very special  thing made a long time ago to prepare some workflows and it was run by mistake.

    So all of the servers turned into orphaned objects.

    The treatment was very simple , accordingly to the article 

    https://kevinholman.com/2018/05/03/deleting-and-purging-data-from-the-scom-database/

    After purging, all servers objects appeared in Pending Management pane then approved and turned into healthy state as it used to be.

    MY THANKS TO ALL FOR SUPPORT !!!


    Tuesday, October 22, 2019 7:10 PM

All replies

  • Hi Andrew,

    Sounds like a bizarre issue, we need to dig deeper into the Operations Manager event log both on the SCOM Management Server and the DMZ Windows agents for a better understanding on your issue.

    Please post the different event errors and their IDs.

    Do you recall any changes made lately to your SCOM environment?

    Could you check that the SCOM database and SCOM management group is healthy?

    If you run the SQL query below towards the SCOM database, what is the "IsDeleted" value for your non-domain Windows servers?

    SQL query

    SELECT 
    [FullName]
    ,[DisplayName]
    ,[IsDeleted]
    FROM dbo.[BasemanagedEntity] 
    WHERE FullName Like '%Windows.Computer%'


    I just wanted to check if your non-domain Windows servers got deleted by mistake from the Operations Console.

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Saturday, October 19, 2019 9:22 AM
  • Leon, thanks a lot !

     I will send you all the data on the Monday.

    Do you recall any changes made lately to your SCOM environment?

    Not at all. Actually we didn't do significant changes at all.

    Also I have a doubt that those servers were deleted . In that case , from my point of view, all their objects were totally gone and I couldn't see then on a "Windows Computer"  pane in HEALTHY state either .


    Saturday, October 19, 2019 9:46 AM
  • About how many agents are we talking about?

    Sometimes it may take some time before the agents are gone from the Monitoring pane, but I’m having a hard time believing this is the case.

    There is also the known phenomenon of ”orphaned objects” (read more about it HERE), this can happen if you delete an agent but the computer object still stays in the Monitoring pane.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Saturday, October 19, 2019 9:54 AM
  • About how many agents are we talking about?

    It has happened for about 20+ manually installed agents (25 or 26 to be clear). Literally all the manually installed Windows agents disappeared.

    Sometimes it may take some time before the agents are gone from the Monitoring pane, but I’m having a hard time believing this is the case.

    I haven't seen such a strange issue ever since I set those agents up.

    this can happen if you delete an agent but the computer object still stays in the Monitoring pane.

    Our team  hasn't intended to delete those agents anyway . Not at all, they works on a very critical DMZ points 8( so the orphaned object isn't the case here I think.


    Saturday, October 19, 2019 10:16 AM
  • I've obtained some info from logs already

    Here is an event from SCOM server

    -------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        OpsMgr Connector
    Date:          19.10.2019 13:34:37
    Event ID:      20000
    Task Category: None
    Level:         Information
    Keywords:      Classic
    User:          N/A
    Computer:      sc-om-ms-03
    Description:
    A device which is not part of this management group has attempted to access this Health Service. 
    Requesting Device Name : server-05
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="OpsMgr Connector" />
        <EventID Qualifiers="16384">20000</EventID>
        <Level>4</Level>
        <Task>0</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2019-10-19T10:34:37.443661600Z" />
        <EventRecordID>11122583</EventRecordID>
        <Channel>Operations Manager</Channel>
        <Computer>sc-om-ms-03</Computer>
        <Security />
      </System>
      <EventData>
        <Data>server-05</Data>
      </EventData>
    </Event>

    ----------------------------------------------------------------------------------------------------

    Here's part from server-05's event log

    -----------------------------------------------------------------------------------------------------

    Log Name:      Operations Manager
    Source:        OpsMgr Connector
    Date:          19.10.2019 13:34:58
    Event ID:      20070
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      server-05
    Description:
    The OpsMgr Connector connected to sc-om-ms-03, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="OpsMgr Connector" />
        <EventID Qualifiers="49152">20070</EventID>
        <Level>2</Level>
        <Task>0</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2019-10-19T10:34:58.529606500Z" />
        <EventRecordID>511333</EventRecordID>
        <Channel>Operations Manager</Channel>
        <Computer>server-05</Computer>
        <Security />
      </System>
      <EventData>
        <Data>sc-om-ms-03</Data>
      </EventData>
    </Event>

    -----------------------------------------------------------------------------------------------------------

    Looks like issues with a certificate authentication presented. But all the certificates are on the place on both sides, we checked them twice. They are totally valid. 

    We checked also connection on port 5723. between two servers. It is in order.

    But we've got also another POSSIBLE clue.

    We noticed time in server-05's Opsmanager Agent logs from where the sh.t started to happen 8)

    Right at the time started upgrade process on Corporate Firewall, it COULD be a reason.

    Some old firewall rules might be deleted. I am going to ask about it.

    I suppose that now there aren't rules to allow connection from/to AD Certification Center  from/to DMZ servers (server-05 include). Server-05 couldn't be checked by the AD Cert center, so connection couldn't be established

    It is only my guess that  I will check it on the Monday.




    Saturday, October 19, 2019 3:34 PM
  • This is why I asked if any changes have been made recently, a firewall upgrade is a significant change and can definitely have an impact.

    It shouldn’t make the agents disappear from the Operations Console however...

    Check the connectivity for the non-domain Windows servers and report back once done.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Saturday, October 19, 2019 3:46 PM
  • Leon, I have an additional question to you.

    Does it mandatory for a non-domain server where SCOM agent installed having an open connection to an AD certification center ? 

    Saturday, October 19, 2019 4:19 PM
  • There's no need to have an open connection to the AD certification (PKI) server.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Saturday, October 19, 2019 8:21 PM
  • Hi Andrew,

     

    For the issue, the most possible cause can be caused by the firewall rule. Maybe network Monitor log can be a helpful tool to diagnostic this.

     

    Hope it can help.

     

    Best regards.

    Crystal


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, October 21, 2019 1:53 AM
  • So here is the news

    Could you check that the SCOM database and SCOM management group is healthy?

    Our SCOM database as well as our SCOM management group are healthy.

    A SQL query result (run against the base) shows to us that there are no servers in deleted state.

    Now we are working together with network admins to find a root cause.


    Monday, October 21, 2019 7:17 AM
  • Thanks for the information, keep us updated!

    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, October 21, 2019 7:22 AM
  • Sure, we are working now with our network admins to check the cause
    Monday, October 21, 2019 7:23 AM
  • Another tricky thing

    When I put a certificate into a cert storage on non-domain server, should a template certificate name be shown as a understandable frase ("OpsMgr Template" in my case) ? I see the name like a set of symbols delimetered by points ("11.2.3.8.9.." ..kind of)

      

    Monday, October 21, 2019 10:36 AM
  • I'm no PKI expert, but I don't think that will cause any issues here.

    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, October 21, 2019 6:50 PM
  • Thank you Leon, the same thought from me.

    Well our network admins opened the case with a vendor, so as to find a right decision. They treat a traffic captured between agents and SCOMs on  a very suspisious way ..well but it is not the one option.

    An another tricky thing has been found. There are only one DMZ server where an agent is working !!!!!

    The one and only thing it distinguishes from other - its workgroup is called the same like AD domain . 8( ) The others have simple "WORKGROUP" in the group name 

    Also there are near 10 servers which were removed from SCOM environment, but they are on again and agents on them ask for approvement . I think it might be a reason too.

    And last but not least , a people from Head Office put  SCOM agent install into a SCCM deployment a few days ago and tried to deploy it ...8 (  ) I am totally crazy with it ,every thing from the list might be a SERIOUS reason ! So..just will continue digging it out.





    Monday, October 21, 2019 8:11 PM
  • Looks like there are many possible causes here, what's important now is to only do one thing at the time, this way it's easier to identify the root cause of the issue.

    Write down the things you do, otherwise you will forget and you might not remember which was the solution.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, October 21, 2019 9:08 PM
  • All the objects were spoiled by a forgotten task from Task Manager on the one of SCOM servers. This task just removed all instances of Microsoft.Windows.Computer class which have their Domain name property not equal to our AD Domain Name. It was very special  thing made a long time ago to prepare some workflows and it was run by mistake.

    So all of the servers turned into orphaned objects.

    The treatment was very simple , accordingly to the article 

    https://kevinholman.com/2018/05/03/deleting-and-purging-data-from-the-scom-database/

    After purging, all servers objects appeared in Pending Management pane then approved and turned into healthy state as it used to be.

    MY THANKS TO ALL FOR SUPPORT !!!


    Tuesday, October 22, 2019 7:10 PM
  • Hi Andrew,

     

    Thanks for your sharing and I am glad that root cause of our issue is found. Congratulations! If there’s anything we can help in the future, feel free to post in our forum. We can discuss together.

     

    Have a nice day!

     

    Best regards.

    Crystal


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, October 23, 2019 6:17 AM