none
DPM 2010 Synch Issues over WAN, consistently end after xx minutes with semaphore timeout error. RRS feed

  • Question

  • Running DPM 2010 on Win2k8R2x64 server with approx 10-12 protection groups protecting about 90 members across 33 servers.  Most (31) of those are on LAN.  Very few alerts are ever generated for any of these.  Synch and recovery point jobs are staggered for the various protection groups to divide up the workload as best as possible.

    One server is remote over WAN (768k circuit), agent version 3.0.7696.0, compression enabled, bandwidth throttling set to 256k/640k (7a-4p business hours).  Remote server is Win2k3R2 w/sp2.  Protecting 2 volumes, D:\ & F:\.  Each volume has roughly 250GB of files/folders.  D:\ sees higher data change than F:\.

    I have successfuly created the replicas for both manually (using USB drive) and have succeeded in running consistency check for both.  Now comes the fun part, keeping them synch'd and creating recovery points.

    F:\ drive synch jobs are all succeeding running within 5-15 minutes every hour.  Recovery points are also successfully being created nightly.

    D:\ drive synch jobs faill consistently.  They run for 46-49 (usually 47) minutes and routinely fail.  The amount of data transferred in that time varies according to the bandwidth throttling schedule (80-90MB during business hrs, and 160-240MB outside business hours).

    I dont understand why... while truing up the replicas initially after creation manually, the consistency check ran (on both D:\ and subsequently F:\) for up to 120 HOURS without an error... But now that the replica is reported consistent, D:\ synch job wont stay running for even a fraction of that time (like I said 46-49 minutes).

    I've monitored the WAN and while utilization is high to this site, it's not overly saturated.  Plus the F:\ synch jobs are running just fine (Successfully) in the midst of all this.

    No DCOM errors on either end.

    No other errors reported on either end except what's captured by the DPM Alerts:  (#1 below happens 8 times out of 10)

    1.)  The DPM service was unable to communicate with the protection agent on <dpm server> (ID 52 Details: The semaphore timeout period has expired (0x80070079))

    2.)  The DPM service was unable to communicate with the protection agent on <protected server>. (ID 65 Details: The connection has been broken due to keep-alive activity detecting a failure while the operation was in progress (0x80072744))

    Recovery points for F:\ are being created too just fine.  However, recover points for D:\ are failing immediately now (even tried creating one w/o synchronizing)... it ends with the following alert:

    1.) Recovery point creation jobs for Volume D:\ on <Protected server> have been failing. The number of failed recovery point creation jobs = x.
     If the datasource protected is SharePoint, then click on the Error Details to view the list of databases for which recovery point creation failed. (ID 3114)

    This is a File/Print server, not a Sharepoint server...

    **Update** I have also verified that the Windows Firewall is set to Off/Allow/disabled for each individual interface on both the DPM and protected server.

    Any help is greatly appreciated,

    Thanks,

    Mark

     


    Mark
    Wednesday, October 13, 2010 10:12 PM

All replies

  • Any suggestions?

    synchs ending after 47 minutes (+/- 1 minute) regardless of network bandwidth throttling or utilization, or server workload just does not make sense.  Here's what else I've tried since original post --

    from both servers (dpm & protected server) right after a synch ended i was able to:

    **ping one another

    **net view \\<server>

    **sc \\<server> query

    ** wmic /node:"<server>" os list brief

    All tests proved successful.

    Also I ran vssadmin list writers immediately following failure (again on both servers) and was able to confirm all the writers are OK and Stable.

    I installed Netmon 3.4 on the remote server and ran a capture (and watched it) during an entire synch.  Nothing out of the ordinary network-wise happened.  The specific trace of the DPMRA.exe conversation showed an initial "handshaking" i guess of TCP, RPC, DCOM packets back and forth followed by a stream of just TCP traffic (I assume the data blocks)... and then just before the "failure" 47 minutes later another sequence/flurry of intermixed TCP, RPC, and DCOM packets before the conversation ended...

     


    Mark
    Friday, October 15, 2010 2:29 PM
  • Mark, did you get an answer on this? We are having the same problem. The two sites are west coast and midwest, 25Mbps fiber Internet with IPSEC tunnel at each end. They fail at 65 minutes (give or take). I see retransmissions in the sniffer output but nothing remarkable.
    Thursday, December 9, 2010 2:50 AM
  • I've had a similar issue as well. I think there's a few articles out there regarding IPSEC and the semaphore timeout problem. Some people made changes by adjusting the key lifetime to a larger interval and that worked for them. I also found out that there is a problem if you have a WAN accelerator. After sniffing the packets I notice that there is a reset package that drops the connection. The best thing is to check the network/routes to see if there's anything significantly different. Hopes this helps.

     

    Lee

    Friday, December 10, 2010 6:21 PM