none
DPM 2010 fails to create replica for 7 million files and subsequent consistency checker jobs never complete "Items scanned:" for millions of files RRS feed

  • Question

  • SUMMARY: 
         * We are moving protection jobs from a DPM 2007 server to DPM 2010 server running on Windows Server 2008 R2 Standard (1 CPU, quad-core with 16 GB RAM).
         * We are trying to protect a 750 GB E:\ volume containing ~7 million small files over a LAN.
         * Original replica failed and all subsequent consistency checking attempts have failed to complete (usually failing to get anywhere near the total number of files on the volume as indicated by the "Items scanned:" numbers.
         * Most other jobs migrated to the DPM 2010 server have been successful with no issues.

    QUESTION:
     Should I delete the replica for the 750 GB volume and start over again? Would I have a better chance of successfully creating the replica rather than running repeated consistency check attempts which fail?

    NOTE: I would attempt this only after all other DPM protection jobs have been migrated to the DPM 2010 server (so I wouldn't be modifying the protection group while attempting this). I noticied that a number of error messages indicated that jobs were probably cancelled due to a change in the protection group - which is true in some cases, but not all. Sometimes, I would kick off another consistency check at the end of the nightly migrations only to find the job failing in the morning with a stuck "Items scanned:" indicator.

    NOTE: The DPM 2007 server had no problems protecting all of these files, so DPM 2010 shouldn't be having total numbers of files issues, right?

    LOGS:
     Attached are the relevant logs (from oldest to newest) when any data was synchronized to the replica, starting with the replica creation log.


    ================================================

    Type: Replica creation
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/15/2012 4:09:47 PM
    Start time: 12/15/2012 3:16:07 PM
    Time elapsed: 00:53:40
    Data transferred: 24,654.19 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/16/2012 12:13:45 AM
    Start time: 12/15/2012 4:11:37 PM
    Time elapsed: 08:02:07
    Data transferred: 75,622.58 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 908364
    Items fixed: 695289

    ================================================

    Type: Consistency check
    Status: Failed
    Description: Number of files skipped for synchronization due to errors has exceeded the maximum allowed limit of 100 files on this data source (ID 32538 Details: Internal error code: 0x809909FE)
     More information
    End time: 12/17/2012 1:11:25 AM
    Start time: 12/16/2012 12:15:54 AM
    Time elapsed: 24:55:31
    Data transferred: 345,636.95 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 5317679
    Items fixed: 4418911

    ================================================

    Type: Consistency check
    Status: Failed
    Description: Number of files skipped for synchronization due to errors has exceeded the maximum allowed limit of 100 files on this data source (ID 32538 Details: Internal error code: 0x809909FE)
     More information
    End time: 12/17/2012 11:25:10 PM
    Start time: 12/17/2012 9:22:28 AM
    Time elapsed: 14:02:41
    Data transferred: 2,363.03 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 1453801
    Items fixed: 4

    ================================================

    Type: Consistency check
    Status: Failed
    Description: DPM failed to communicate with <<servername>> because of a communication error with the protection agent. (ID 53 Details: Server execution failed (0x80080005))
     More information
    End time: 12/18/2012 1:17:23 AM
    Start time: 12/17/2012 11:29:15 PM
    Time elapsed: 01:48:08
    Data transferred: 174.80 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 120734
    Items fixed: 0

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/18/2012 3:47:04 PM
    Start time: 12/18/2012 7:42:24 AM
    Time elapsed: 08:04:40
    Data transferred: 286.72 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 200443
    Items fixed: 0

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/18/2012 4:12:43 PM
    Start time: 12/18/2012 3:49:57 PM
    Time elapsed: 00:22:45
    Data transferred: 8,961.30 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 35493
    Items fixed: 16

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/18/2012 4:17:06 PM
    Start time: 12/18/2012 4:13:11 PM
    Time elapsed: 00:03:55
    Data transferred: 15.76 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 167
    Items fixed: 0

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/19/2012 4:54:36 PM
    Start time: 12/18/2012 4:17:37 PM
    Time elapsed: 24:36:59
    Data transferred: 1,951.22 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 1193491
    Items fixed: 12

    ================================================

    Type: Consistency check
    Status: Failed
    Description: The job was cancelled. The user either cancelled the job or modified the associated protection group. (ID 908)
     More information
    End time: 12/20/2012 7:11:54 AM
    Start time: 12/19/2012 5:03:37 PM
    Time elapsed: 14:08:16
    Data transferred: 9,525.46 MB
    Cluster node -
    Source details: E:\
    Protection group: SEATTLE SERVERS
    Items scanned: 385323
    Items fixed: 3

    Thanks,

    Chris


    • Edited by SFMChris Friday, December 21, 2012 5:47 PM minor edit
    Friday, December 21, 2012 5:44 PM

Answers

  • UPDATE: To resolve this issue (and before attempting to delete the partial replica and start all over again), I started to research the application that had created the millions of files (FileNet) and discovered that it was not unknown to cause file system errors. I ended up running chkdsk three times fixing hundreds of thousands of errors on the disk. After the third chkdsk pass, the consistency checker was able to run all the way through without stalling and create the replica. For the last 2 weeks or so, DPM has been able to synchronize the replica successfully and the files are now being protected.

    CONCLUSION: I don't know if there is an upper limit to the number of files per volume that Microsoft can protect, but I can say that we are protecting over 7 million files on a volume with DPM 2010. 

    RECOMMENDED IMPLEMENTATION PLAN: If you are planning to protect a file server with millions of files, and after you have installed the DPM agent and it is talking to the DPM server, I recommend you run CHKDSK repeatedly until you come up clean (configuring it to run at startup, then rebooting each time). I believe then that you should have no issues with the replica creation.

    Chris

    • Marked as answer by SFMChris Thursday, January 10, 2013 5:36 PM
    Thursday, January 10, 2013 5:36 PM