locked
Active Directory MP: no alert for events 1566, 1311, 2904, 1865 & 467 ? RRS feed

  • Question

  • Greetings,

    We have a Win2K8 Active Directory with Win2K8R2 servers: 3 DC's and 12 remote RODC's.

    One of the RODC's was completely out of service (AD-wise) for about one month (we spotted this by chance), but SCOM didn't report anything.

    The server was indeed up, the SCOM agent was working fine, network was OK, files were accesible.

    But the whole AD Database of this server was corrupted. Users at our branch office where the RODC is, were redirected to other DCs so this was not business-critical, but still, I find it very strange that SCOM (or AD MP, that is) didn't report anything, whether it be through Computers states, or through Alerts view.

    Thousands of those events showed up in the Custom Logs -> Server Roles -> Active Directory Domain Services log on this server :

    Event 1566 (warning)
    All directory servers in the following site that can replicate the directory partition over this transport are currently unavailable.
    Site: [our central site]

    Event 1311 (error)
    The Knowledge Consistency Checker (KCC) has detected problems with the following directory partition.
    Directory partition:
    CN=Configuration,DC=[our domain]
    There is insufficient site connectivity information for the KCC to create a spanning tree replication topology. Or, one or more directory servers with this directory partition are unable to replicate the directory partition information. This is probably due to inaccessible directory servers.
    (...)

    Event 2904 (error)
    This event documents additional REPAIR PROCEDURES to resolve the NTDS KCC Event 1311 on a read-only Active Directory Domain Controller.
    Local Site:
    CN=[branch office site],CN=Sites,CN=Configuration,DC=[our domain]
    User Action:
    (...)

    Error 1865 (warning)
    The Knowledge Consistency Checker (KCC) was unable to form a complete spanning tree network topology. As a result, the following list of sites cannot be reached from the local site.
    Sites:
    [our central site]

     

    Error 467 (error)
    NTDS (524) NTDSA: Database C:\Windows\NTDS\ntds.dit: Index backlink_index of table link_table is corrupted (0).

     

    Once we had all this info, we managed to fix the problem quite rapidly.

    I understand that when the Agent is unable to contact the management servers, SCOM cannot report the errors but here, we have an apparently healthy SCOM service, simply NOT reporting what seems to be really heavy problems.

    I am the only SCOM admin who writes overrides and I double-checked: there isn't any mention of those events anywhere in our custom overrides MP's.

    As I don't have the sufficient knowledge to analyse the original AD MP, I ask this question here:

    Why is the AD MP unable to detect such problems?

    • Edited by Bixessss Wednesday, November 9, 2011 2:48 PM
    Wednesday, November 9, 2011 2:40 PM

All replies

  • nature of RODC does not allow to use SCOM technics to check if replication works or not.

    try to tune AD client monitoring MP to include RODC into synthetic LDAP transactions. 

    Wednesday, November 9, 2011 8:42 PM
  • Thanks for your answer Pavel,

    I will try this as soon as time allows.

    Meanwhile, can you explain a little bit more about nature of RODC vs SCOM technics?

    I would be glad to transmit this to my colleagues, who are currently frowning on SCOM for not having reported those issues ;-)


    Bix
    Thursday, November 10, 2011 3:43 PM
  • http://social.technet.microsoft.com/Forums/en-US/operationsmanagermgmtpacks/thread/dada5270-6c70-4146-b314-d841176a8b08

    RODC does not allow SCOM agent to update records on the DC as the DC is ReadOnly.

    So, no replication from RODC to AD, but there is from AD to RODC. Sorry, I was not 100% correct. SCOM could have notified you about broken replication from AD to RODC. Why did not do this... do you have any evidence of SCOM agent malfunction in RODC event logs(OperationsManager)? 

    And i do remember a note in AD MP deployment, something like "in order MP to work correctly you need to deploy SCOM agents on each DC in AD". Do you have agents deployed everywhere? Did you set proxy enabled for each agent?(just checking)
    Thursday, November 10, 2011 7:29 PM
  • The RDOC might report that other DC's arent replicating, but it will not mention in the alert which DC is reporting this. I says "might" for a reason, bc i think the AD container/change in question is not necessarily transfered to a RODC as it doesn't have a complete copy of AD (never tested so i really dont know :)). in any way, it will never generate an alert that the RODC was "slow" with replication.

     


    Rob Korving
    http://jama00.wordpress.com/
    • Edited by rob1974 Friday, November 11, 2011 1:44 PM
    Friday, November 11, 2011 1:43 PM
  •  
    1. do you have any evidence of SCOM agent malfunction in RODC event logs(OperationsManager)? 

    2. Do you have agents deployed everywhere? Did you set proxy enabled for each agent?(just checking)

    1. Actually yes I found a few lines in that even log, on the faulty RODC: 

     Error 61 (warning)
    AD Replication Monitoring : The following DCs have not updated their OpsMgrLatencyMonitor objects within the specified time period (24 hours). This is probably caused by either replication not occurring, or because the 'AD Replication Monitoring' script is not running on the DC.
     
     

    Funny that the "Domain" value is the FQDN of the faulty RODC.

    There are a few other errors but I don't think they are related (event ID 19;15;7)

     

    2.

    Every single monitored server has an agent up and running, including all DCs/RODC's.  Not *all* servers are proxy enabled, but all RODC's/DC's are. I usually wait for the critical alert saying the proxy enabling is required on a specific server, to enable it. Should I enable it on absolutely all server nonetheless?


    Bix
    Monday, November 14, 2011 4:57 PM
  • Hi Bixessss,

    We are getting hundreds of the above errors on our domain as well (1566, 1311, 2904, 1865)

    Please advise what steps you took to correct it.

    Thanks

    Monday, February 13, 2012 10:06 AM
  • Hello Ndomingo,

    One of the DCs had one of its NICs deactivated because we had a NIC-teaming problem with them. We executed some (Windows)Update on the server, which apparently included new drivers for the NIC.  This re-activated the NIC which asked (and received) the DHCP for an address (unteamed, that is), and that caused major mess.

    In order to solve the problem we have released the wrong IP and deactivated the unwanted NIC.


    Bix

    Monday, February 13, 2012 1:09 PM