none
DPM SP1 ERROR - DPM could not run the backup, number of currently runnng jobs reached its limit RRS feed

  • Question

  • DPM Error

    Type: Recovery point
    Status: Failed
    Description: DPM could not run the backup job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.
    Data source: \Backup Using Child Partition Snapshot\Server name
    Production Server: XXXXXX (ID 3185 Details: Internal error code: 0x809909E5)
     

    I have 6 DPM servers backing up a 2012 hyper-v cluster with 6 nodes. Total VM's to backup are 220 spread across the 6 dpm servers.

    I have 1 protection group with same name on each 6 DPM servers.  Backups start at 1am. So if i have 1 DPM server with 35 vm's, half of them will fail with error above.

    How do we tread or have the jobs wait until there is room for the next job to run?

    Any help appreciated

    Friday, March 15, 2013 4:34 PM

Answers

  • View Jobs is fixed - must open scom by ip or server name & not an alias.

    to date i am backing up 250VM's in a night. no issues. (DPM Server 2012 sp1 backing up Server 2012 SP1 8 node cluster w\15 CSV's)

    • Marked as answer by Anncex Thursday, September 12, 2013 3:47 PM
    Thursday, September 12, 2013 3:46 PM

All replies

  • Per PS (in this node of a Hyper-V server) is configured to do 8 backup jobs and I believe you are hitting this condition.  You can do the following to validate this.  Run the jobs from different DPM servers at different time or increase the number of expressfull backups that can be run to bigger number beyond 8 to 18(... 6 DPM servers and each DPM server by default sends out 3 backup jobs) which is default.  Please note that this number has a bearing on how the Hyper-V host perform as the number of backups increase, IOPS, NW, process resources for production environment will go down.  This change need to be done at the following location on each of the production server. 

    C:\Program Files\Microsoft Data Protection Manager\DPM\bin\DsResourceLimits.xml

    BTW any reason you have 6 DPM servers?  How big is the average VM size and churning?

    Monday, March 18, 2013 11:26 AM
  • I am going to break up the backup start times of the 6 DPM servers (i'll keep you posted)

    Reason we have 6 DPM servers is we were testing and had the storage within the 6 servers. So each server has 6.5TB of capacity, a total of 39TB. We will be backing up a Hyper-V 2012 6 node cluster using CSV's, with expansion. so we needed the space to backup the 6 nodes which are 10TB each. We are testing so we can provide a solution in our prod environment. As there are not many backup solutions ready to backup server 2012. We tried Netbackup & TSM but they have many issues and DPM is working great for the time being.....except the treading of the jobs with the error above. We have VM's in this cluter "disk allocated" range from 2.68GB to 125GB in size.

    Monday, March 18, 2013 2:37 PM
  • Hi Neela,

    The C:\Program Files\Microsoft Data Protection Manager\DPM\bin\DsResourceLimits.xml seem to include a limit for multiple job types, can you tell us the description of job types 1-4 so we can modify the right one for recovery point.

    <?xml version="1.0"?>
    -<DatasourceLimits>
          -<Writer isParallelRecoveryAllowed="true" version="0" writerId="66841cd4-6ded-4f4b-8f17-fd23f8ddc3de">
                <MaxLimit type="1" value="8"/>
                <MaxLimit type="2" value="8"/>
                <MaxLimit type="3" value="8"/>
                <MaxLimit type="4" value="8"/>
           </Writer>
    </DatasourceLimits>


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.


    Monday, March 18, 2013 10:26 PM
    Moderator
  • Hi Mike,

    Type 2 and 3 are not applicable to HyperV. For HyperV only type=1 is applicable, please change the limit for type=1 to desired value. But keep in mind that increasing this value may affect PS performance as more backups will use more resources on PS.


    Thanks. Please mark this post as answers if it helps.
    “This post is provided "AS IS" with no warranties, and confers no rights”.

    Tuesday, March 19, 2013 5:05 AM
  • I ran my backup last night and 94 failed for all kinds of reasons. So spreading the load didnt help much. There must be a way to run hyper-V backups using DPM, I need to backup 161 virtual machines...but running consistency checks all day is painful! I will keep testing and i hope i dont have to call microsoft :( I'll keep you posted
    Tuesday, March 19, 2013 7:34 PM
  • What other errors are you getting ?

    Are you aware of this announcement at the top of this hyper-V forum ?  Do you have that fix installed and change implemented ?

    <relevant snip from announcement>
    V2 - Windows Server 2012 Hotfix available that helps resolve DPM 2012 Sp1 Hyper-V backup problems.

    ******** New V2 announcement ********

    The Windows team has just released a V2 of the fix to address CSV backup issues and is available for download today.  This will address the known memory leak issue along with some other issues that were discover during testing.

    This fix Supersedes the original fix.

    Virtual machine enters a paused state or a CSV volume goes offline when you try to create a backup of the virtual machine on a Windows Server 2012-based failover cluster
    http://support.microsoft.com/kb/2813630

    NOTE: After you install the hotfix, CSV volumes do not enter paused states as frequently. Additionally, a cluster's ability to recover from expected paused states that occur when a CSV failover does not occur is improved. 

    To avoid CSV failovers, you may have to make additional changes to the computer after you install the hotfix. For example, you may be experiencing the issue described in this article because of the lack of hardware support for Offloaded Data Transfer (ODX). This causes delays when the operating system queries for the hardware support during I/O requests.

    In this situation, disable ODX by changing the FilterSupportedFeaturesMode value for the storage device that does not support ODX to 1. For more information about how to disable ODX, go to the following Microsoft website:

    General information about how to deploy ODX
    http://technet.microsoft.com/en-us/library/jj200627

    If you continue to see problems protecting Windows 2012 Hyper-V guests after installing the above hotfix, please open a support case for further investigation.
    >snip<

    You can also try disabling TRIM, that sometimes helps, run this from administrative command prompt:

       fsutil behavior set disabledeletenotify 1


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, March 19, 2013 8:15 PM
    Moderator
  • HI Mike, Thanks for reviewing this for me. I will try installing the hotfix tomorrow on one of my DPM servers to see if it makes a difference.

    Not sure if this is related but 2 of my cluster nodes rebooted at some point last night, still researching why.....

    last night 94 failures,  errors were;  I've been running consistency checks all day down to 38 to go.....

    1) DPM encountered a retryable VSS error. (ID 30112 Details: VssError:The writer experienced a transient error.  If the backup process is retried,

    the error may not reoccur.

    2) The DPM service was unable to communicate with the protection agent on XXXXXX (ID 52 Details: The semaphore timeout period has expired (0x80070079))

    3) Change Tracking has been marked inconsistent due to one of the following reasons
    1. Unexpected shutdown of the protected server
    2. Unforeseen issue in DPM Bitmap failover during cluster failover of one or more datasources sharing the tracked volume. (ID 30501 Details: Unknown error (0xe0062000) (0xE0062000))

    4) DPM could not run the backup job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.
    Data source: \Backup Using Child Partition Snapshot\XXXXXX
    Production Server: XXXXXXXXX (ID 3185 Details: Internal error code: 0x809909E5)

    5) DPM was not able to complete this job within the allotted time. (ID 911)

    Tuesday, March 19, 2013 9:02 PM
  • Hi,

    The windows hotfix needs to be applied to all of the cluster nodes.

    I can explain most of these errors, see my Answers inline.

    1) DPM encountered a retryable VSS error. (ID 30112 Details: VssError:The writer experienced a transient error.  If the backup process is retried,

    the error may not reoccur.

    1A) DPM is the victim of an underlying Windows VSS error, check the application event log for more details.

    2) The DPM service was unable to communicate with the protection agent on XXXXXX (ID 52 Details: The semaphore timeout period has expired (0x80070079))

    2A) This is a network or server performance issue.

    Diagnostic steps when "Semaphore timeout" is hit during network transfer:

    1. Check if the protected server (sender) or DPM (receiver) was under stress or inaccessible during the time of failure – from event logs from both the machines. Retry should work if the packet loss was because of either of the servers being inaccessible or under stress for a period.

    2. Check if the network between the PS and the DPM is flaky – retransmit count from ‘netstat -s’ or perfmon counters can give an idea.

    3. If the network is expected to be flaky, setting a higher TCP/IP maximum retransmission timeout as described in
    http://support.microsoft.com/kb/170359 might help -increase the TcpMaxDataRetransmissions to 10 or more.

    4. Else contact network support engineer to diagnose the packet loss issue – netmon captures from both machines, packet route and network layout/devices will be required to start the investigation.

    5. Take some performance monitor logs on both DPM and Protected server side.

    Some good and basic perfmon counters to take to see if the servers are under stress are below.

    Logical Disk/Physical Disk
    ******************
    \%idle
    • 100% idle to 50% idle = Healthy
    • 49% idle to 20% idle = Warning or Monitor
    • 19% idle to 0% idle = Critical or Out of Spec
    \%Avg. Disk Sec Read or Write
    • .001ms to .015ms  = Healthy
    • .015ms to .025 = Warning or Monitor
    • .026ms or greater = Critical or Out of Spec
    Current Disk Queue Length (for all instances)
    80 requests for more than 6 minutes.
    • Indicates possibly excessive disk queue length.
    Memory
    *******
    \Pool Non Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Pool Paged Bytes*
    • Less that 60% of pool consumed=Healthy
    • 61% - 80% of pool consumed = Warning or Monitor.
    • Greater than 80% pool consumed = Critical or Out of Spec.
    \Available Megabytes
    • 50% of free memory available or more =Healthy
    • 25% of free memory available = Monitor.
    • 10% of free memory available = Warning
    • Less than 100MB or 5% of free memory available = Critical or Out of Spec.
    Processor
    *******
    \%Processor Time (all instances)                                                                   
    • Less than 60% consumed = Healthy
    • 51% - 90% consumed = Monitor or Caution
    91% - 100% consumed = Critical

    3) Change Tracking has been marked inconsistent due to one of the following reasons
    1. Unexpected shutdown of the protected server
    2. Unforeseen issue in DPM Bitmap failover during cluster failover of one or more datasources sharing the tracked volume. (ID 30501 Details: Unknown error (0xe0062000) (0xE0062000))

    3A)  Anytime a Node crashes, we loose data change tracking, so this error is by design.  Troubleshoot the cause for the node crash.

    4) DPM could not run the backup job for the data source because the number of currently running backup and recovery jobs on the Production server has reached its limits.
    Data source: \Backup Using Child Partition Snapshot\XXXXXX
    Production Server: XXXXXXXXX (ID 3185 Details: Internal error code: 0x809909E5)

    4A) Solution was provided in earlier response.

    5) DPM was not able to complete this job within the allotted time. (ID 911)

    5A) Most likely you have limited the amount of time a CC can run, increase the time allowed to run. 


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Tuesday, March 19, 2013 9:21 PM
    Moderator
  • HI Mark, Im applying the patch today. Last night i had 79 failures & a new error

    Volume Shadow Copy Service error: Unexpected error querying for the IVssWriterCallback interface.  hr = 0x80070005, Access is denied.

    . This is often caused by incorrect security settings in either the writer or requestor process.

    THX

    Wednesday, March 20, 2013 4:14 PM
  • HI Mark, I am still having issues with all my 166 VM guest from backing up. the weird part some do complete but others are getting;

    Type: Consistency check
    Status: Failed
    Description: DPM was unable to establish a connection with the Virtual Machine Manager (VMM) server.
    Server name: XXXXXXX.
    Exception Message: Type: System.TimeoutException, Message: This request operation sent to net.tcp://XXXXXXX.Domain.net:6070/VmmHelperService/TcpEndpoint did not receive a reply within the configured timeout (00:01:00).  The time allotted to this operation may have been a portion of a longer timeout.  This may be because the service is still processing the operation or because the service was unable to send a reply message.  Please consider increasing the operation timeout (by casting the channel/proxy to IContextChannel and setting the OperationTimeout property) and ensure that the service is able to connect to the client. (ID 33400)
     More information
    End time: 4/9/2013 5:06:58 AM
    Start time: 4/9/2013 5:05:53 AM
    Time elapsed: 00:01:05
    Data transferred: 0 MB
    Cluster node -
    Source details: \Backup Using Child Partition Snapshot\XXXXXXX
    Protection group: XXXXXXX Hyper-V Cluster
    Items scanned: 0
    Items fixed: 0

    Tuesday, April 9, 2013 5:03 PM
  • I'm still working on my issues - finally was able to get 5 out 6 dpm servers to complete 150 vm's backups.

    here a few things i had to do.....

    1) Patch DPM server with patch rollup 2

    2) Disable ODX

             Disable ODX support. To do so, type the following command:

                 Set-ItemProperty hklm:\system\currentcontrolset\control\filesystem -Name        "FilterSupportedFeaturesMode" -Value 1

    3) Disabled TRIM

    I added - Disable trim by running this command on all Nodes:

       fsutil behavior set disabledeletenotify 1

    4) updated agent on host

    I seems to be back in business.

    Now only if Operations manager would show "view Jobs" then i can had this project off to ops team.  

    Wednesday, April 24, 2013 10:04 PM
  • View Jobs is fixed - must open scom by ip or server name & not an alias.

    to date i am backing up 250VM's in a night. no issues. (DPM Server 2012 sp1 backing up Server 2012 SP1 8 node cluster w\15 CSV's)

    • Marked as answer by Anncex Thursday, September 12, 2013 3:47 PM
    Thursday, September 12, 2013 3:46 PM