none
Replica Inconsistent - unexpected error RRS feed

  • Question

  • We have a server where we cannot get a successful consistency check to run (only on one of its volumes).

    The consistency check fails with this:

    An unexpected error occurred during job execution. (ID 104 Details: Internal error code: 0x80990A51)

    On the protected server, the DPM error log has many lines that look like this around the time the job fails:

    Failed: GetFileHandleById failed to open file, frn:0x00BA00000001009E: 0x80070057

    Any suggestions as to what I should try to fix this problem?

    Wednesday, August 25, 2010 5:37 PM

Answers

  • The problem appears to be caused by a WAN acceleration device that was put in place at the site recently.  It didn't hit us that it was the cause of the problem because there were no issues for over a week after installation.  Once other file servers in this site started exhibiting the same issue with DPM, it clicked.
    • Proposed as answer by ShaneB. _ Monday, August 30, 2010 3:33 PM
    • Marked as answer by Rod Savard Monday, August 30, 2010 3:45 PM
    Monday, August 30, 2010 2:35 PM

All replies

  • Hi

    What version of DPM are you running? 2007 or 2010? Microsoft made a fix regarding handeling of files that were in use by other processes that made the DPM server "skip" the file.

    Verify that your antivirus hasn't got a file in its qurantine, if so DPM will never be able to backup that file since the AV has it locked.

    BR

    Robert Hedblom


    Check out my DPM blog @ http://robertanddpm.blogspot.com
    Wednesday, August 25, 2010 8:06 PM
    Moderator
  • Thank you for your reply.  We are running 2007 w/ all hotfixes applied.

    I thought since DPM uses VSS, locked files are not an issue and nothing is ever "skipped" because it's locked?

    Wednesday, August 25, 2010 9:48 PM
  • We have run CHKDSK on the volume to verify there are no filesystem errors.  That did not help.

    Anyone else have an idea?

    Thursday, August 26, 2010 3:43 PM
  • It could be your local AV on the server that your trying to protect. Look if you have any files in your qurantain in your AV. If so, the AV has made a complete isolation of the file and DPM will never be able to backup that file.

    Microsoft has released a feature "skip File" that would fix this. Verify that you really have all the HF and patched installed.

    BR

    Robert Hedblom


    Check out my DPM blog @ http://robertanddpm.blogspot.com
    Friday, August 27, 2010 7:03 AM
    Moderator
  • Hi Robert, yeah I saw you suggest checking AV in your first reply.  I did check and there are no files in quarantine. I would have been surprised if that was the cause, as we have hundreds of file servers protected by our DPM infrastructure and i'm sure several of them *do* have files in quarantine.  It seems to present no problem for DPM.

    Any other ideas?  is there a way to see exactly which file there's a problem with during a consist check? All I see in the log is a file handle.

    Friday, August 27, 2010 1:39 PM
  • I believe these lines in the DPMRA error log are relevant:

    DE4    0CEC    08/27    17:29:02.167    03    workitem.cpp(206)            Idle Timer FIRED For WorkItem = 0000000000489B20, WorkItem GUID = {5E20F7FB-0F24-4AF3-B4B7-A56B13DCCC68}, TimerOrWaitFired: True
    0DE4    0DA8    08/27    17:29:02.183    05    fsmtransition.cpp(142)    [000000000048C6    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    IsCancelEvent: completion: 0xa006, signature: 0xaabbcc00, hr: 0x0!
    0DE4    0E40    08/27    17:29:02.183    22    hwvreceiversubtask.cpp(148)    [0000000005090C    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : m_pBuffersQueue->AddElement(pBuffer): 0x80990a51
    0DE4    0E40    08/27    17:29:02.183    22    dsmreceiversubtaskbase.cpp(451)    [0000000005090C    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : hr: 0x80990a51
    0DE4    0E40    08/27    17:29:02.183    22    dsmreceiversubtaskbase.cpp(253)    [0000000005090C    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : OnReadCompleted(dwNumberOfBytes, pAgentOvl, dwError): 0x80990a51
    0DE4    0E40    08/27    17:29:02.183    22    dsmreceiversubtaskbase.cpp(204)    [0000000005090C    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : ProcessWaitCompletion(dwNumberOfBytes, pAgentOvl, dwError): 0x80990a51
    0DE4    0E40    08/27    17:29:03.105    22    dsmsendersubtaskbase.cpp(155)    [0000000000421A    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    CDsmSenderSubTaskBase received session closed completion in WAIT state
    0DE4    0E40    08/27    17:29:03.105    22    dsmsubtaskbase.cpp(228)    [0000000000421A    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Session closed before data move completed
    0DE4    0E40    08/27    17:29:03.105    22    dsmsendersubtaskbase.cpp(157)    [0000000000421A    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : OnSessionClosed(dwNumberOfBytes, pAgentOvl, dwError): 0x80072746
    0DE4    0E40    08/27    17:29:03.105    22    dsmsendersubtaskbase.cpp(279)    [0000000000421A    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    Failed: F: lVal : ProcessWaitCompletion(dwNumberOfBytes, pAgentOvl, dwError): 0x80072746
    0DE4    075C    08/27    17:29:09.745    39    aasubtask.cpp(911)    [00000000003F42    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    <?xml version="1.0"?>
    0DE4    075C    08/27    17:29:09.745    39    aasubtask.cpp(911)    [00000000003F42    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    <Status xmlns="http://schemas.microsoft.com/2003/dls/StatusMessages.xsd" StatusCode="-2137453999" Reason="Error" CommandID="RAReadDatasetFixup" CommandInstanceID="daeb8fd9-2d38-4e90-8cde-31c1ba0054e0" GuidWorkItem="5e20f7fb-0f24-4af3-b4b7-a56b13dccc68" TETaskInstanceID="828b77fd-16d4-4907-a5c7-55bc85dbc08e"><ErrorInfo xmlns="http://schemas.microsoft.com/2003/dls/GenericAgentStatus.xsd" ErrorCode="998" DetailedCode="-2137453999" DetailedSource="2"/><RAStatus><RAReadDatasetFixup xmlns="http://schemas.microsoft.com/2003/dls/ArchiveAgent/StatusMessages.xsd"><LWVStatus BytesTransferred="40807976" NumberOfFilesTransferred="487307" NumberOfFilesFailed="0" DataCorruptionDetected="false"/><FixupStatus BytesTransferred="48758784" NumberOfFilesTransferred="15565" NumberOfFilesFailed="0" DataCorruptionDetected="false"/></RAReadDatasetFixup></RAStatus></Status>
    0DE4    0E00    08/27    17:29:10.527    39    freesnapshotsubtask.cpp(1046)    [00000000050A3D    |TaskID=828B77FD-16D4-4907-A5C7-55BC85DBC08E    CIdleCleanupSnapshotSubTask: 1 DS

    Friday, August 27, 2010 6:23 PM
  • The fileshare that you got the replication error with is it located on a share that has been expanded? Like a SAN?
    Check out my DPM blog @ http://robertanddpm.blogspot.com
    Saturday, August 28, 2010 9:29 AM
    Moderator
  • No, it is DAS.  It has not been expanded since we put this server originally on DPM, where it had been working fine for many weeks.
    Saturday, August 28, 2010 3:27 PM
  • Try to modify the protection group that the file share is a member of. Run through the wizard and try to do a concistencey check.

    /Robban


    Check out my DPM blog @ http://robertanddpm.blogspot.com
    Monday, August 30, 2010 12:34 PM
    Moderator
  • The problem appears to be caused by a WAN acceleration device that was put in place at the site recently.  It didn't hit us that it was the cause of the problem because there were no issues for over a week after installation.  Once other file servers in this site started exhibiting the same issue with DPM, it clicked.
    • Proposed as answer by ShaneB. _ Monday, August 30, 2010 3:33 PM
    • Marked as answer by Rod Savard Monday, August 30, 2010 3:45 PM
    Monday, August 30, 2010 2:35 PM
  • Hello Rod,

    If you don't mind me asking, what sort of "acceleration device" was put in place? Were there any other issues or was it just DPM that was failing?

    Thanks,
    Shane

    Monday, August 30, 2010 3:33 PM
  • Riverbed Steelhead appliance.  We are going to work with Riverbed to see what we need to do to get it to work properly.  It may be a simple configuration error.
    Monday, August 30, 2010 3:45 PM
  • I would be interested to hear what Riverbed comes back with.  We are utilizing both (DPM and Riverbed) at remote sites.  Havent seen that specific internal error code, and manualy created replica consistency check ran fine (just slow), but am having problems with scheduled synchs failing for no apparent reasons after almost exactly 47 minutes.  (posted here: http://social.technet.microsoft.com/Forums/en-US/dpmsetup/thread/81b1ce13-0b54-4adb-9c51-8272a72455cc)
    Mark
    Thursday, October 14, 2010 3:04 PM
  • The problem was out-of-date software on one of the Steelheads, combined with trying to use Full Transparency feature.  We updated the software on that Steelhead device and haven't had a problem since.

    We are getting VERY impressive DPM traffic optimization with Steelhead appliances.

    Friday, October 15, 2010 2:59 AM