none
DPM2012: Recovery point creation failed on selected SQL databases in a protection group RRS feed

  • Question

  • I have a protection group setup to protect 13 SQL databases all within the same SQL instance and consistently 5 of them fail recovery point creation with the following error:

    DPM failed to communicate with the protection agent on backup1.riddlesdown.local because the agent is not responding. (ID 43 Details: Internal error code: 0x8099090E)

    Recovery points for the other 8 are created without any problems.  If I go back and manually create recovery points for the 5 that failed they work.

    There are no firewalls between the the DPM server and server I am trying to backup.  DPM Server is running DPM2012 and is physical, server I am trying to backup is a Hyper-V Guest.  The agent status is OK and responds when refreshed.

    I'm having the issue on a couple of other servers and the time the job runs for is always 00:05:35.  Is that some sort of timeout period?

    I've run a wireshark capture at the time the recovery point creation and can see there is communication but can't see if there is something going wrong because I don't really know what I am looking at/for.

    Does anybody have any ideas or able to interpret the wireshark capture?  I've uploaded in to a skydrive in case anybody can: https://skydrive.live.com/#cid=2CAC77A324AACA98&id=2CAC77A324AACA98%21124



    • Edited by adamf83 Wednesday, August 8, 2012 2:26 PM
    Wednesday, August 8, 2012 2:22 PM

Answers

  • On both servers run this from elevated command prompt:

    netsh int tcp set global congestion=none
    netsh int tcp set global rss=disabled

    netsh int tcp set global chimney=disabled

    Afterwards move the job to run at the origional time and test.

    • Marked as answer by adamf83 Tuesday, August 14, 2012 4:50 PM
    Monday, August 13, 2012 2:35 PM

All replies

  • A couple of questions:

    How is the protection group configured for SQL, are you just using an express full backup or are you running incremental sync as well?

    Are you also doing a VHD backup of the guest that holds the SQL databases?

    Are you able to check the recovery mode of the databases on SQL - is there any correlation between the recovery mode type (full/simple) and the databases that are failing?
    Has the recovery model changed on those databases since the creation of the protection group?

    Are you co-locating SQL data on the same volume?

    I haven't looked at the capture but if you're concerned about network communication or firewall blocks, temporarily disable the firewall on both the DPM and SQL server to see if the issue occurs. It doesn't sound like a network issue but if you don't have an open network temporarily for testing at least that will be ruled out.

    I remember an incident with DPM 2010 some time back with similar behaviour on the TechNet forums. There were a number of resolution attempts tried in that case and after a SQL server restart the troublesome databases started protecting.

    Wednesday, August 8, 2012 10:07 PM
  • Danny,

    The PG is configured to take 3 express backups a day and incremental syncs every hour.

    I'm not doing a VHD backup for the guest that holds the SQL databases

    Recovery modes are all set to full and it hasn't changed since the creation of the PG

    Yes, the SQL data is all co-located on the same volume.

    There are no firewalls, it's already an open network.  I guess my reasoning for takingt he network capture is because the error is related to communication.

    I'll try a restart of the server and see how that goes.

    Thursday, August 9, 2012 6:53 AM
  • Danny,

    I restarted the server and recovery point creation has failed for the same 5 databases.

    Any other ideas?

    Thursday, August 9, 2012 1:36 PM
  • If you click on modify disk allocation of the protection group for the SQL databases, are the collocation volumes split or are all databases residing on one volume. If a split are the databases that are experiencing issues on the same collocated volume?

    What is the size of the databases that are not protecting and how much space do you have on the volumes in which the log files for the database reside. When protection is scheduled can you look in the event log to see if there is a correlation between the SQL engine pausing I/O on the DB and resuming I/O.

    When you manually run protection are you running an express full backup? Is the scheduled express full failing or only the scheduled incremental or both?

    Thursday, August 9, 2012 11:43 PM
  • They are all on one volume.

    Size of databases failing to protect are all under 100MB.  Space left on the volume where the logs are stored is 67GB

    I've looked at the event logs on the server at the time a recovery point has failed.  All I can see is:

    Log was backed up. Database: SiconDMS, creation date(time): 2012/02/20(15:06:00), first LSN: 22:1216:1, last LSN: 22:1236:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'C:\Program Files\Microsoft SQL Server\MSSQL10_50.SAGE200\MSSQL\DATA\DPM_SQL_PROTECT\SAGE200\SAGE200\SiconDMS.ldf\Backup\Current.log'}). This is an informational message only. No user action is required.

    There are only events for the databases that are not failing.  Nothing for the database that are failing.

    When I manually run protection, I choose the express full option.

    Friday, August 10, 2012 2:28 PM
  • Just to add, I have manually run protection choosing incremental this time and it completes.  It's just the scheduled jobs that seem to be a problem.
    Friday, August 10, 2012 8:15 PM
  • How big are the five DBs that fail?  How big are the ones that don't?
    Saturday, August 11, 2012 11:59 AM
  • The ones that fail are all less than 60MB, with the largest being 54MB, smallest being 4MB.  Out of those that don't fail, all less than 70MB, largest being 68MB, smallest 2MB.



     
    Saturday, August 11, 2012 12:07 PM
  • Well as a basic trouble shooting try to move the backup times for the protection group.  It sounds like network congestion on the DPM Server NIC.

    Also please post the results of this command for both DPM server and SQL Server:

    netsh int tcp show global


    • Edited by ACorbs1 Monday, August 13, 2012 1:51 PM
    Monday, August 13, 2012 1:48 PM
  • I agree with that.  I've done some of my own testing.  I've offse the sync time by 5 minutes and since doing that no more failures.  If I move it back 5 minutes, the failures start again.

    Here the the output from the DPM server:

    TCP Global Parameters
    ----------------------------------------------
    Receive-Side Scaling State          : enabled
    Chimney Offload State               : enabled
    NetDMA State                        : enabled
    Direct Cache Acess (DCA)            : disabled
    Receive Window Auto-Tuning Level    : normal
    Add-On Congestion Control Provider  : ctcp
    ECN Capability                      : disabled
    RFC 1323 Timestamps                 : disabled

    Output from the SQL server:

    TCP Global Parameters
    ----------------------------------------------
    Receive-Side Scaling State          : enabled
    Chimney Offload State               : automatic
    NetDMA State                        : enabled
    Direct Cache Acess (DCA)            : disabled
    Receive Window Auto-Tuning Level    : normal
    Add-On Congestion Control Provider  : ctcp
    ECN Capability                      : disabled
    RFC 1323 Timestamps                 : disabled

    Monday, August 13, 2012 2:20 PM
  • On both servers run this from elevated command prompt:

    netsh int tcp set global congestion=none
    netsh int tcp set global rss=disabled

    netsh int tcp set global chimney=disabled

    Afterwards move the job to run at the origional time and test.

    • Marked as answer by adamf83 Tuesday, August 14, 2012 4:50 PM
    Monday, August 13, 2012 2:35 PM
  • When the scheduled jobs fail are there incremental syncs scheduled for the same time that express fulls are scheduled to run? It's just that the same 5 databases were failing all the time which seems a little too consistent. Offsetting the incremental start times might be avoiding a conflict of express full and incremental?

    Monday, August 13, 2012 9:39 PM