locked
agents in trusted child domain showing as not monitored RRS feed

  • Question

  • Hi All,

    I have gone around the houses with this including to microsoft and nobody has been able to help so far, am hoping someone here will at least be able to point me in the right direction.
    my scenario:
    I have 4 management servers on parent domain, lets say :

    SCOM1.parent.domain.local
    SCOM2.parent.domain.local
    SCOM3.parent.domain.local
    SCOM4.parent.domain.local

    I have a few servers i need to manage on different child domains lets say
    SCOMAgent.Child.Parent.domain.local

    There is a full 2 way transitive trust between the 2 domains hence I am NOT using certs.

    I have run wireshark and confirmed that network isnt the problem (however i must admit networking isnt my strong point so any pointers or re-confirmation i will gladly do)

    I have used portqry / telnet and confirmed that i am able to connect to the following ports

    >from agent to MS telnet 5723, 135, 139, 445

    As the SCOM run as account i am using a domain account (parent\scom_action) and i have manually added this user into the local administrators group of the agent server, so we can rule out any permissions issue.

    In SCOM my security settings (agent onboarding) for agents are "Review first" and yes i normally check that the agent hasnt appeared in "pending management"

    The parent\scom_action account is a local admin of the machine i am deploying the agent to

    My agentpendingaction table is empty so there are no agents blocking any new actions

    I am able to deploy successfully to other servers in the same parent.domain.local domain, I am just having trouble with agents in the child.parent.domain.local. 

    The error I am getting is :
    on the agent :
    The OpsMgr Connector connected to SCOM1.Parent.local, but the connection was closed immediately after authentication occurred.  The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.  Check the event log on the server for the presence of 20000 events, indicating that agents which are not approved are attempting to connect.
    This is followed by:
    The Health Service cannot verify the future validity of the RunAs account Parent\scom_action for management group SCOM_GROUP due to an error retrieving information from Active Directory (for Domain Accounts) or the local security authority (for Local Accounts).  The error is The RPC server is unavailable.(0x800706BA).
    And follwed by:
    The Health Service was unable to validate any user accounts in management group SCOM_GROUP.

    The only error I see on the managementserver is : "SCOMAgent.child.parent.domain.local is not heartbeating"

    I am not sure what to do next to be honest and any pointers would be gratefully acccepted.

    Thursday, December 6, 2012 11:21 AM

Answers

  • I wonder if you are seeing this issue:

    http://nocentdocent.wordpress.com/2012/10/26/opsmgr-2012-agents-across-slow-wan-links-are-unable-to-communicate/

    Cheers

    Graham


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    • Marked as answer by Humjill Thursday, December 20, 2012 1:43 PM
    Thursday, December 6, 2012 1:07 PM
  • Hi Humjill

    Jep it is frustrating...

    I am sure, if you install a domain controller from the problematic domain in the same site where your management server lives, the problem will be gone. I will do this soultion for the customer this or maybe next year.

    I think we can choose this as a solution...

    Cheers,

    Stefan


    Blog: http://blog.scomfaq.ch

    • Marked as answer by Humjill Thursday, December 13, 2012 9:54 AM
    Wednesday, December 12, 2012 5:22 PM

All replies

  • HI

    Have you tried configuring the agents in the sub-domain to use local system as the agent action account as this seems to be the issue:

    The Health Service cannot verify the future validity of the RunAs account Parent\scom_action for management group SCOM_GROUP due to an error retrieving information from Active Directory (for Domain Accounts) or the local security authority (for Local Accounts).  The error is The RPC server is unavailable.(0x800706BA).

    Cheers

    Graham


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    Thursday, December 6, 2012 11:31 AM
  • Hi Graham,

    Yes I have but just to be sure this is the process I am going through :

    Discovery - find agent on child domain - deploy

    use childdomain\administrator to find servers

    Use local system

    deployment successful

    agent still showing in console as not monitored (same errors as first post)

    This is what you mean right?

    Thursday, December 6, 2012 11:45 AM
  • Yep.

    Can the servers in the child domain check the domain controllers in the parent domain. I think it is port 88 and 389 for kerberos \ LDAP that would need to be open.

    http://msdn.microsoft.com/en-us/library/ms960403(v=cs.70).aspx

    It sounds like a similar issue to this? Could you confirm with WireShark the same traffic as they saw?

    http://amradmin.wordpress.com/2011/09/13/issue-with-the-scom-agent-authentication-against-the-scom-management-server-if-you-have-multi-domain-environment/

    Cheers

    Graham


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    Thursday, December 6, 2012 11:56 AM
  • ok have confirmed that

    from the agent > I can telnet onto a DC in the parent domain using 88 and 389, this rules out any firewall rules right?.

    Also I have run wireshark and I can see :

    LDAP request and a response of successfull

    various Kerberos packets

    No errors that jump out at me...

    Thursday, December 6, 2012 1:01 PM
  • I wonder if you are seeing this issue:

    http://nocentdocent.wordpress.com/2012/10/26/opsmgr-2012-agents-across-slow-wan-links-are-unable-to-communicate/

    Cheers

    Graham


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    • Marked as answer by Humjill Thursday, December 20, 2012 1:43 PM
    Thursday, December 6, 2012 1:07 PM
  • I wonder if you are seeing this issue:

    http://nocentdocent.wordpress.com/2012/10/26/opsmgr-2012-agents-across-slow-wan-links-are-unable-to-communicate/

    Cheers

    Graham


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    I saw similar problem, all necessary ports were opened including kerberos and ldap so it's highly probably...but maybe some network configuration problem 
    Thursday, December 6, 2012 9:42 PM
  • Thanks both, the script from the link you have sent isnt working for me at the moment.  will let you know what the results are but from initial pings it doesnt look like it.  Alex, you said you have seen a similar problem, do you mind if i ask what your problem was in the end and how you resolved it?
    Thursday, December 6, 2012 9:46 PM
  • as I said every port needed were opened, DNS names resolved as well as reverse zone, SPNs registed, MS were configured to automatically accept new agent, I can RDP to that servers with no issues, but it took me several minutes to login, so everything were allowed but there were very slow links or some problem with network hardware and I saw eventid 20071 then 20070 from the side of agents and 20002 on MS. Restarting SCOM services from MS and agent did not help. On the remote site there were about 20 agents and all with the same symptom.  Then Nothing were done from both the sides, But one week later everything worked without anyone participation (and I have not even restarted SCOM Services), so it seemed that some network issues (there were so many network hardware used between sites) so you need deeper network investigation and also try to investigate DCs from both domains. Never seen similar situation with good communication channel.
    Friday, December 7, 2012 9:08 AM
  • Update:

    Another weird issue :  I had installed 3 agents into the child domain, 2 on test 2008R2 servers and 1 on the child domain DC (2003 server) This morning when i checked the management console one of the 2008 agents in the child domain had reported in and is now in a 'green' healthy state (after spending over a week in a grey state, and without me doing anything).  Then about an hour ago the same agent went back into a grey state and the other 2008 agent is now showing as 'healthy'.....what on earth is going on.....

    Monday, December 10, 2012 2:39 PM
  • Hi

    It seems that you are also a victim of issues which Graham sent you the link http://nocentdocent.wordpress.com/2012/10/26/opsmgr-2012-agents-across-slow-wan-links-are-unable-to-communicate/ . I posted also some additional Information on my blog and Extended the script  it may help you out...check here http://blog.scomfaq.ch/2012/12/09/scom-2012-event-id-20070-agent-across-slow-wan-links/

    Stefan


    Blog: http://blog.scomfaq.ch

    Monday, December 10, 2012 8:02 PM
  • Thanks Stefan and everyone else.  The script tells me the agent lookup takes on average 700miliseconds to finish.  In your experience Stefan what was an acceptable response time in order for communication to work well?
    Tuesday, December 11, 2012 9:34 AM
  • I haven't personally seen this but Daniele Grandini in his article states:

    -          The ICMP latency between the Management Server and the Domain Controller is above 150 msec (this is no fixed rule)

    - This specific issue manifest itself when the Active Directory lookup takes more than 1000 msec (more or less);

    If the agents have been green (healthy) intermittently then it suggests that it is not a configuration issue as such with the agent but some sort of communication \ authentication issue. And that suggests you are seeing the same issue .. at present there isn't a straight forward workaround as you can't force the agents to use certificates to get around the problem. 

    The notes at the bottom of Daniele's article are not promising - http://nocentdocent.wordpress.com/2012/10/26/opsmgr-2012-agents-across-slow-wan-links-are-unable-to-communicate/


    Regards Graham New System Center 2012 Blog! - http://www.systemcentersolutions.co.uk
    View OpsMgr tips and tricks at http://systemcentersolutions.wordpress.com/

    Tuesday, December 11, 2012 8:00 PM
  • HI Hujill

    Well I have some bad news. First if I compare my values and calculate the arithmetic mean at the time the agent was green and the time the Agent was grey in one problematic domain there is no real indicator about a specific value. In my case it took about ~505 ms for the specific "agent-simulated" query. I had very high values (up to 994 ms) but also very low values (as low as 2 ms).

    If I calculate the average mean for a non-problematic domain I end up with about 114 ms....

    The second bad news is that I had today Microsoft support on the phone and they don't have any solution for it. The support staff promised to let me know if there will be another solution than installing a domain controller in the same site as the management server resides.

    So what is the solution? Installing a domain controller from the problematic domain in the same site where the management server lives.

    I hope this helps...

    Stefan


    Blog: http://blog.scomfaq.ch

    Tuesday, December 11, 2012 10:09 PM
  • Thanks Stefan,

    I have run the query on a few problematic domains (all across wan links).  I will post results here and hopefully I might be able to proceed with the 'workaround'.  Do you know if the this workaround sorts out the problem 100%? i.e. the problem hasnt come back since you installed a DC in the same site where the MS lives?

    Wednesday, December 12, 2012 8:56 AM
  • OK so I have tested across 4 child domains and the Mean value i get back across all of them are around 500ms, as with Stefan sometimes I would get back a very low value 1 or 2 ms, and sometimes as high as 900ms.  All this time the agent stayed 'grey'.  Fustrating... 
    Wednesday, December 12, 2012 9:57 AM
  • Hi Humjill

    Jep it is frustrating...

    I am sure, if you install a domain controller from the problematic domain in the same site where your management server lives, the problem will be gone. I will do this soultion for the customer this or maybe next year.

    I think we can choose this as a solution...

    Cheers,

    Stefan


    Blog: http://blog.scomfaq.ch

    • Marked as answer by Humjill Thursday, December 13, 2012 9:54 AM
    Wednesday, December 12, 2012 5:22 PM
  • I dont think I am going to be able to deploy this at my client site for the time being so will mark the case as answered for the time being.
    Thursday, December 13, 2012 9:54 AM
  • Hi Stefan,

    Thanks for helping us out ("us" being Humjill too).  I wondered if you had an opportunity to test the suggested fix of adding a "local" DC?  Many thanks. Hutton. 

    Tuesday, January 8, 2013 9:38 AM
  • Hi Hutton

    The Domain Controller is implemented, didn't have Chance to check...hold on :)

    Stefan


    Blog: http://blog.scomfaq.ch

    Tuesday, January 8, 2013 9:42 AM
  • ...I am back...jep the fix works now at the customer. I can confirm it...

    Blog: http://blog.scomfaq.ch

    Tuesday, January 8, 2013 10:30 AM
  • Hi Stefan,

    thats great news.  thanks for getting back to us.


    Tuesday, January 8, 2013 9:07 PM
  • Hi
    had the same case with grayed agents.
    opened port TCP 389 from MP to DCs and the problem resolved.

    hope this helps someone
    http://silentcrash.com/
    Wednesday, May 14, 2014 6:36 AM