none
dpm 2010 cant talk to agents RRS feed

  • Question

  • i just installed a fresh 2010 server to do some testing. I was able to deploy the agents to my 2 clients just fine and the agents appear to be version 3.0.7696.0.

    THey worked fine once installed and then stopped over the weekend. Both servers im protecting are server 2003.....one just backing up some files and the other backing up a couple of small sql databases in sql 2005.

    over the weekend both servers stopped responding. YOu could RDP to them but you would just get a black screen without any response. The local console acted the same way. Nothing in the event logs (which you could still get to from a remote machine) about what might have happend on either machine. If you tried to open services.msc on either machine remotely the mmc would just hang and there it would sit. WE have rebooted both servers and both are up and functioning on the network however the DPM server now states this for both servers:

    Protection agent version: 3.0.7696.0
    Error: Data Protection Manager Error ID: 318
     The agent operation failed because DPM was unable to identify the computer account for beltmonitor.beltmann.com.
    Detailed error code: The trust relationship between this workstation and the primary domain failed
    Recommended action: Verify that both beltmonitor.beltmann.com and the domain controller are responding. Then in Microsoft Management Console (MMC), open the Group Policy Object Editor snap-in for the local computer and verify the local DNS client settings in Local Computer Policy\Computer Configuration\Administrative Templates\Network\DNS Client.

    DNS appears fine....and my Dcs are working just fine. I can ping by name and ip all day long. Ive used the DCOM WBEMTEST to test the remote agent connection and it worked just fine last week but now it wont connect at all. It would appear to be a dcom issue on both servers that have the agent installed. I have since ran all windows updates hoping that would show some signs of change but nothing.  Im at a loss. No  firewalls or anything like that in play here. One thing i find interesting is that it appears that the sevices.msc is unreponsive if opened on either box with the remote agent.

    any ideas?

     

    • Moved by MarcReynolds Tuesday, October 5, 2010 1:46 PM (From:Data Protection Manager)
    Tuesday, October 5, 2010 1:38 PM

All replies

  • Let's test various connectivity between the DPM server and protected server. We'll need to test basic connectivity, SMB, RPC, and WMI/DCOM. We'll need to test in both directions, from DPM server to protected server and from protected server to DPM server.

    The commands below need to be run from an administrative command prompt. It is a good idea to test from both the DPM server and the protected server. The account used should be an administrative account on both servers.

    Basic connectivity is tested by using ping. If ICMP traffic is blocked ping commands will fail but that is OK.
      ping <protected server name>

    Next test SMB (file sharing).
      net view \\<protected server name>

    Now test RPC and connectivity to Service Control Manager (SCM). This displays a list of services on the remote server when successful.
      Sc \\<protected server name> query

    Lastly test WMI/DCOM. When successful this command lists some basic information about the remote server.
      Wmic /node:"<protected server name>" OS list brief

    If any of the tests after ping fail that may be where the problem is. Please note any error messages encountered.

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, October 5, 2010 1:58 PM
    Moderator
  • Update

     

    if i reboot the servers .....intially DCOM testing with WBEMTEST works just fine. I have rebooted one of them and it is responding as i would expect. I am still getting the error in the DPM manager but DCOM works. IT would appear over time that the DCOM fails.....services.msc fails.......remote rdp and reboots via the shutdown command fail....

     

     

    Tuesday, October 5, 2010 2:51 PM
  • So DPM is probably more of a victim than a causal factor. You indicated you patched the servers (Windows Update) to alleviate this but no luck. Also ensure your drivers (NIC, etc.) are up to date on the protected servers.

    Once updated then we should check this. Your pattern of failure reminds me of the issue 2003 servers have with the Scalable Networking Pack (SNP). Please see the following article for information on this.

    An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, October 5, 2010 3:02 PM
    Moderator
  • very interesting.

     

    Im looking into this now. It seems to be dang close to what im experiencing. The servers are indeed running server 2003 with SP2.

     

    Thanks so much for the help. Ill report back on what happens once i check my nic drivers and also my TCP offload settings.

    Wednesday, October 6, 2010 2:31 PM
  • i have looked at the registry  on the affected machines. All of the registry settings in the kb article above are set properly. I didnt have to change them.

     

    I did download a newer nic driver for the server. Ill apply it and see if the results change at all but as far as the KB article and SNP is concerned...it appears to already be disabled.

     

     

    Wednesday, October 6, 2010 4:11 PM
  • Any update on this issue?

    I am experiecing the same issues on all of my 2003 servers with DPM 2010.

    Trying to migrate from Symantec Backup Exec.

    Unistalled the Symantec Agent.

    rebooted.

    patched the server.

    rebooted.

    installed DPM 2010 agent.

    server runs fine for about a day, then locks up just as listed above. Most services are still operational but black screen when using RDP and console is locked up with a solid gray screen.

     

    Monday, October 11, 2010 6:08 PM
  • John,

    Did you check the NIC/registry settings for the Scalable Networking Pack (SNP) that can cause this behavior?

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Tuesday, October 12, 2010 11:45 AM
    Moderator
  • yes.

    I have checked the registry and both servers are set properly. I did take one server and upgrade the NIC drivers from the manufacture.

    At this point that server remained working fine.

     

    the second server later locked up so i updated the drivers here as well.

     

    I thought all was ok but found out now that the second server agent status will update to OK. The first server just sits saying attemping....

    I login to the first server and it wont pull up services.msc. IT is all locked up again.

    I thought i had it figured out and now i guess not. WBMTEST connects but i cant seem to get it to query any information via WMI.

    I am still connected via rdp but the server wont respond to anything. I cant issue a shutdown or a remote shutdown. It responds to clicks and i can naviagate around...i just cant seem to issue any windows commands or open things. I have to hard boot the server for things to work properly again for a undetermined amount of time.

     

     

    Back to the drawing board.

    Wednesday, October 13, 2010 4:24 PM
  • The server that can't open services.msc and such seems to be having other issues than DPM. It would seem that DPM is a victim. If a reboot fixes the server for some length of time then it starts having problems then I suspect some kind of memory/resource leak. About all I can recommend is using Performance Monitor to try to find the culprit.

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Wednesday, October 13, 2010 6:53 PM
    Moderator
  • thanks steve. Ill see what i can do with perfmon

    both servers were fine....for years......then we installed the dpm agent and bam...they hang. Nothing has changed on either machine other than the DPM agent.

     

    Im going to try perfmon and see what happens

    Thursday, October 14, 2010 8:23 PM
  • Nothing

     

    I ran perfmon against some of memory options and all resources stayed in check, yet both servers are hung up. I can navigate them fine but i cant seem to run service.msc. I open a blank mmc. WHen i open services.mmc the mmc.exe starts in the taskmgr but you get nothing on the screen. The machine acts strange in other areas. You can click reboot all day long and it doesnt do anything. YOu can send a remote reboot and it thinks it is set to to reboot but doesnt do it.

     

    IM going to try to completely remove the agent from 1 of the 2  machines or at least disable it via msconfig and see if that changes the behavior.

    Friday, October 15, 2010 1:45 PM
  • If you can try testing this. On one of the servers set the DPMRA service to disabled (that way it won't start). Reboot it and see if it misbehaves. You can then use the other server as a control with DPMRA configured as normal.

    When looking at performance data see what the DPMRA process is doing when the server is in state.

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, October 15, 2010 1:49 PM
    Moderator
  • i did set the process to to be watched under perfmon. It didnt show anything strange. I disabled it and after it was disabled it still hung the system over the course of a 24 hour period. Again i couldnt open services.msc and various other things. I have since removed the agent using the DPM server and rebooted the affect machine. I have not had a issue since that time. The true test will be if it is still running tomorrow when i get to the office.

    Tuesday, October 19, 2010 10:10 PM
  • update

     

    the machine that i removed the agen from works just fine and has been stable for more than 24 hours.

     

    the machine that i still have the agent that is set to disabled has frozen up after setting the service to disabled. After hard booting server yesterday evening it  continues respond normally so far but i woudnt trust it on my production enviorment just yet.

     

    Clearly the server that i have removed the agent from is working as expected. This tells me that the agent removal solved the problem.

     

    im out of ideas.

    Wednesday, October 20, 2010 1:22 PM
  • If the DPMRA service is indeed disabled then it is not likely DPMRA is hurting the box (it is not running). We do install a filter driver as well. I've not seen these cause issues as you describe but I do not doubt what you are seeing in your testing.

    The way I'd troubleshoot this if this were a case I had would involve perfmon and probably to get some memory crash dumps of the box when it is hanging.

    Regardless of that it really feels like a leak/resource issue. Maybe some driver doesn't work well with the DPM filter driver. Like I said, I am not sure since I have not seen this behavior.

    In the system and application logs on the server that hangs I'd look for any patterns of services or applications crashing leading up to the unresponsive state.

    Make sure all the drivers for NICs, hardware, and etc. are up to date on the hanging servers. Make sure that any third party antivirus/firewall is the latest version with the latest definitions.

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Wednesday, October 20, 2010 5:53 PM
    Moderator
  • i can report this and feel 90% accurate about it:

     

    When the DPMRA service is set to manual i experience the behavior.

    When i set the DPMRA to disabled  i will still experience the behavior.

    If i set the DPMRA service to disabled and then reboot the system after disabling the service the server seems to run ok. (this one doesnt make much sense as disabled means it shouldnt run and isnt running. Ive verfied that it isnt via taskmgr).

     

    That said....on one of the 2 boxes that ive deployed to ive removed the protection groups and the agent per the server and life on that server is back to normal and stable as it was prior to the install.

     

    I dont seem to have anything in my event viewer logs. They are clean as can be.

    I dont see any dump files created but ill have another check.

    Perfmon didnt give me anything to go on as far as monitoring DPMRA. Ive given up here for now but will revisit.

     

    I have excluded the DPM directories on both servers from being scanned by our AV (trend micro). I have also gone as far as to completely remove the AV on one of the servers and that had no affect.

     

    You say there is a filtering service. Im wondreing if this is somehow causing the issue and somehow setting DPMRA to disabled and rebooting the machine stablizes the filter or keeps it from running? No idea

     

    I have also downloaded the latest manufacture driver from DELL and HP for both systems for both NICs.

     

    we have had great results with this product in a more modern enviroemnt with DPM 2010 but not this one. The boss has now pulled the plug and doesnt want to "waste any more time on it".


    That doesnt mean i dont want to find the cause. I have dozens of 2003 servers that i would love to back up and seperate clients so a fix / or at least a cause would be great to find.

    Thursday, October 21, 2010 5:54 PM
  • I appreciate your tenacity in chasing this issue. What makes this odd is that I've not encounterd a problem with DPM protecting 2003 servers. The fact that it happens to both servers really lends itself to some kind of interoperablity issue that DPMRA helps manifest.

    I am not a perfmon guru but I would look at everything not just what DPMRA is doing. Kernel mode stuff may shed light on other things that are affecting the box.

    When I mentioned a crash dump I was describing how I'd troubleshoot this in a customer case. We'd set the box up to do a full memory dump and when the box was in state we'd crash it. Then go through the dump. There are many tools and such to chase performance issues like this but that is handled by some peers.

    /Steve


    Steve L [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.
    Friday, October 22, 2010 1:41 PM
    Moderator
  • Just wanted to add to this case as I too had a very similar problem in that 5 of our production servers were crashing regularly and grey screening with the same symptoms as above, and DPM was failing in the same manner. This started within a couple of days of me installing DPM 2010 agents to all of them so naturally I suspected DPM was the cause.

    Anyway after a whole bunch of troubleshooting I finally gave in and logged a call with MS product support. Because we were unable to generate a memory dump as the servers had greyscreened rather than bluescreen we set up the NMI crash dump switch in the registry

    http://support.microsoft.com/kb/927069 and http://support.microsoft.com/kb/969028

    I generated the bluescreen condition using the NMI switch in the iLO console - http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00797875/c00797875.pdf

    The analysis of the memory dump in my case was that TmEvtMgr was causing a deadlock. This is a Trend Micro file.

    I then upgraded Trend Officescan to 10.5 and installed the latest patches, and this caused all the affected servers to greyscreen again (during business hours.... doh!), so it appears that the greyscreens were being caused when Trend update was running to update the program files. After disabling TmEvtMgr in Device Manager hidden devices, and manually uninstalling Trend, I was able to successfully upgrade to 10.5 on the affected servers and we have not had the grey screen problem since

     

     

    Wednesday, December 1, 2010 10:38 AM