none
Is anyone successfully protecting Storage Server 2008 R2 with SIS volumes? RRS feed

  • Question

  • We are having major issues with our DPM 2010 server (running on 2008 R2 SP1 with DPM version 3.0.7707.0) ever since we started protecting our new 3-node Dell NX3000 cluster running Storage Server 2008 R2 SP1 with SIS enabled volumes.  Our DPM server started freezing within 8 hours of a power cycle shortly after creating new PG's for the new cluster.  This turned out to be a deadlock issue with SIS on the DPM side which required installing a hotfix. 


    Unfortunately, after installing the unreleased hotfix our DPM server has still had a multitude of problems:  inconsistent volumes; missing volumes after disk management tasks (rescan disks, create new slice, add new volumes); synchronizations failing; backups to tape failing after a very small amount of data written; etc. 


    We have three DPM servers protecting a variety of LAN and WAN servers.  Only one of these DPM servers protects SIS enabled shares on a storage server cluster and it is the only server having issues.  We have also replicated some of our problems on a separate test environment. The DPM server with the issues was rock solid for over a year protecting non-SIS data sources. 

    The DPM server that is having the problems:

    • Dell R510 w/ 48GB RAM, PERC H700 and H800 controllers
    • Qty 48 - 2TB disks (12 internal, 36 external on MD1200's)
    • DPM storage pool consists of eight ~8.5TB volumes
    • TL4000 w/ four LTO-5 drives (off two dual-port 6Gb SAS controllers)


    This server used to protect our file server cluster which ran on 2008 SP2 with approximately 27TB of data spread across 6 volumes (8TB, 8TB, 8TB, 8TB, 4TB, 4TB).  Everything worked great and we somehow did not run into the issue of trying to backup "millions of files" to tape.  Disk IO was not a problem and the server could easily stream 4 jobs to tape.  Our backup to tape window was less than a day. 


    When we migrated to the new Storage Server cluster, we split up our large volumes into many smaller mounted volumes.  Each department was given their own volume and share.  So far we have 45 mounted volumes with SIS enabled and saving quite a bit of storage space.  Volume size ranges from 128GB to 8TB.  (5x128GB, 6x256GB, 2x384GB, 9x512GB, 2x768GB, 7x1024GB, 3x1536GB, 5x3072GB, 5x4096GB, 1x8192GB)


    The NX3000 nodes have 96GB RAM; 10Gb NICs for iSCSI and LAN; and storage is on our EMC SAN.  I loaded each node with 96GB since during our tests with storage server we knew SIS would eat quite a bit of RAM.  We had also hit the file cache bug on our old cluster so better safe than sorry.  So far there have not been any issues on the file cluster side.  The cluster is configured in an active/active/test setup with three cluster resource groups.  Most of our data is split between two resource groups and the "test" resource group has a few non-production shares.  


    A run-down of the problems we've hit since adding SIS is as follows:


    When DPM creates new disk volumes (either by creating a new PG or expanding replica/RP volumes), volumes may go missing.  The disk partitions are still online, however, the disk label is blank instead of "DPM_vol-<guid>."  A re-scan of the disks may bring the volume back or it may cause others to go offline.  Re-scanning volumes causes the same problems as above.  It is like playing russian roulette.   When the volume comes back after being missing it is marked as inconsistent which requires a check that usually fails.


    Synchronizations often fail with:

    • DPM failed to communicate with the protection agent on DPMSERVERNAME because the agent is not responding. (ID 43 Details: Internal error code: 0x8099090E)


    Tape jobs fail with:

    •  (ID 2019 Details: An existing connection was forcibly closed by the remote host (0x80072746))
    •  The protection agent on DPMSERVERNAME was temporarily unable to respond because it was in an unexpected state. (ID 60 Details: Internal error code: 0x809909B0)
    •  DPM failed to communicate with the protection agent on DPMSERVERNAME because the agent is not responding. (ID 43 Details: Internal error code: 0x8099090E)


    Our configuration may or may not be larger than many here, but I'd love to hear from those of you who are or have tried to protect Storage Server with DPM.  Happy?  Sad?  Same boat as us?

    Thanks,

    -Wayne



    Thursday, March 1, 2012 4:26 PM

All replies

  • have you seen this thread?

    http://social.technet.microsoft.com/Forums/en-US/dataprotectionmanager/thread/6c27016f-2fa9-40ce-83dc-4d135e4b9e21/


    Jeffrey S. Patton Assistant Director of IT School of Engineering Computing Services University of Kansas 1520 West 15th Street Lawrence, KS. 66045-7621 | http://patton-tech.com

    Monday, March 19, 2012 1:52 PM
  • Yes, I've contributed to that thread as well.  The purpose for this thread was to see if anyone was protecting Storage Servers and not having problems.  So far nobody has chimed in, but people usually don't head to the forums unless they are having problems. 

    We ended up reverting to Server 2008 SP2 on the DPM server but we are still having issues protecting some SIS enabled volumes.  Replica creations, consistency checks, syncs and RP creations failing due to the agent on the local DPM server not responding, etc.  The DPM console is also extremely slow now when doing syncs, CC's and replica creations on SIS volumes.  

    Monday, March 19, 2012 2:01 PM
  • I thought your name looked familiar ;-)

    funny, that is exactly what i'm protecting, in almost the same setup. a pair of nx3000's clustered with an md3000i backend which is the only difference. sis is enabled and the latest hotfix for sis on all three servers. we also did the volume split that you mentioned, we did that because we were noting that dpm was having issues backing up so much user data. ~3200 users on a 4tb disk, with millions of files...while technically supported...dpm didin't appear to be able to keep up with the churn. we had better luck after splitting the drives.


    Jeffrey S. Patton Assistant Director of IT School of Engineering Computing Services University of Kansas 1520 West 15th Street Lawrence, KS. 66045-7621 | http://patton-tech.com

    Monday, March 19, 2012 2:06 PM
  • Since reverting to 2008 SP2 on the DPM server we have been OK for the most part except continued failure of some jobs (tape backups and syncs) due to unknown issues.  The major showstopper now is how consistency checks behave on SIS enabled volumes.  We recently had an incident on the file cluster which caused volumes to fail to go offline during a move between nodes which marked them as dirty -- which requires a consistency check on the DPM side.  (This behavior can be replicated with clean volumes just by initiating a CC manually) 

    The volumes in question have been protected by DPM for probably 2 months and the recovery point volume size remains fairly static (2 snaps per day, 32 days protection).  Most of our volumes have a very low data churn rate.  Here are stats from a few volumes before the consistency checks:

    volume replica (GB) RP (GB)
    vol1 240 19
    vol2 1519 158
    vol3 1889 162
    vol4 518 20

    The RP volumes were about 25% larger than the size used to allow for some growth and autogrow is enabled. What happens during the CC is that DPM will transfer a huge amount of data (sometimes about the size of the replica volume itself) and the items scanned/fixed counters are almost equal as if it is seeing every file as needing to be re-synced with DPM.  

    If you don't manually set the RP volume size to be insanely large (relative to the steady state size) the first CC will fail due to insufficient RP space.  It will also auto-grow and another CC will kick off.  The second will fail causing another auto-grow.  If you keep re-running the CC without manually setting the RP volume to a huge size the repeated consistency checks will cause the RP volume size to reach sizes greater than the original replica volume -- even with no data changed on the original protected volume.   

    We are now seriously considering disabling SIS on our file cluster due to the issues with DPM.  We are going to try DPM2012 to see if anything is fixed, but I'm not holding my breath.


    • Edited by Wayne Justin Sunday, May 6, 2012 1:22 AM table format
    Sunday, May 6, 2012 1:22 AM
  • It appears that consistency checks in DPM on SIS-enabled volumes is severely broken.  We were hoping that DPM 2012 would resolve the issue, but my latest test confirms that it does not. 

    I used a Storage Server 2008 R2 server with SIS and non-SIS volumes and a DPM 2012 server on top of 2008 R2.  The data on the Storage Server had not changed in several months and has been fully groveled for duplicates.  After the initial replications and creation of several recovery points I tested consistency checks on a SIS and non-SIS volume that were in the healthy state. 


    The non-SIS volume CC behaved as normal and completed very quickly:

    • 113,700 items scanned
      0 items fixed
      9.56 MB transferred


    The volume status did not change after the check:

    • Replica volume: 250.00 GB allocated, 169.24 GB used | Recovery Point volume: 24.00 GB allocated, 1.28 GB used


    The SIS-enabled volume was another story. 
    The volume status before the check:

    • Replica volume: 258.03 GB allocated, 171.62 GB used | Recovery Point volume: 45.61 GB allocated, 2.34 GB used


    The first check failed after the RP volume ran out of space causing an auto-grow.  The status of that check was:

    • 526,224 items scanned
      168,490 items fixed
      76,169.79 MB transferred


    The replica/RP volume stats changed to the following after the failure:

    • Replica volume: 258.03 GB allocated, 160.82 GB used | Recovery Point volume: 71.27 GB allocated, 56.59 GB used


    I noticed that the replica allocation shrunk -- probably as a result of the failed check causing files to be deleted and their blocks moved to the RP volume?


    The second CC completed with the following job stats:

    • 263,112 items scanned
      73,060 items fixed
      26,445.16 MB transferred


    The volume info after the completed consistency check changed to:

    • Replica volume: 258.03 GB allocated, 171.62 GB used | Recovery Point volume: 71.27 GB allocated, 67.31 GB used


    It looks like we will now have to burn TAM hours to resolve this issue with DPM along with our missing volume issue. 


    Thursday, May 10, 2012 7:25 PM
  • "Cease and De-SIS"

    We did open a case with MS regarding the consistency check issue and it is indeed a bug where the replica is marked as invalid when it should not be.  Unfortunately, we have not received a fix or ETA on a fix as our case was closed due to contract issues and we are waiting for MS to re-open under a new contract.  

    With that said, I believe we are going to abandon SIS due to a number of factors:

    • Replica creation times of SIS volumes is much slower.
    • Restoration times of SIS volumes is much slower.
    • Tape jobs run slower with SIS volumes.
    • SIS causes missing volumes on a DPM server running 2008 R2.  
    • Microsoft moving to chunk-dedupe and possibly abandoning SIS in Server 2012+.  


    As our data set size has grown quite a bit I did a number of tests on replica creation and restoration times to test our RTO.  In our testing we have discovered that SIS causes much longer job times compared to the same data on a non-SIS volume.  For example:


    Volume A is SIS'd and contains about 650GB of data.   The SIS savings are:


    === Analysis of volume 'R:\A' on FILESR2 ===
    Common store files:                  102187
    Link files:                          319743
    Inaccessible link files:             0
    Space saved:                         350735429 KB


    Replica creation of this volume takes about 6.5 hours.  Restoring the volume takes 11.5 hours.  


    To test the same data without SIS, robocopy was used to transfer the data to a second volume that did not have SIS installed.  The data set bloated to about 1TB, which was expected.  Creating a new replica in DPM took just under 4 hours and restoring the data took 4.5 hours.  So transferring MORE data took less time.  While watching the resource monitor and performance monitors on the storage server it appears that the junction point creation is slowing things down greatly.  


    I did the same tests on a second volume that has much more SIS usage due to some not so educated users.  The SIS savings for the second volume are:


    === Analysis of volume 'R:\B' on FILESR2 ===
    Common store files:                  194646
    Link files:                          1484157
    Inaccessible link files:             0
    Space saved:                         618897990 KB


    Disk recovery time for the SIS volume was 26 hours vs 9 hours de-SIS'd.  SIS recovery transferred 1.2TB and de-SIS'd recovery transferred 1.8TB.  Given this, I believe we have no choice but to abandon SIS.  We will certainly miss the disk savings, but the problems and headaches we have had with SIS (in relation to DPM) are just not worth it.  


    • Edited by Wayne Justin Friday, July 27, 2012 3:25 PM formatting
    Friday, July 27, 2012 3:22 PM
  • Hey Wayne,

    Sorry you've not gotten any better results in your case. I don't think that i have either. We left it with Microsoft recommending me to upgrade ram on my dpm server over 4gb. I'm not sure where that stands now as i'm no longer with that group on campus. I did hear that they upgraded ram, but they just did a major overhaul of storage. They moved from an md3000i solution to a compellant array, and are now exploring other backup solutions as well.

    I think DPM is great, I don't know if our issue was SIS as much as it was data churn. I still assert that in production use, DPM can't keep up with the amount of churn generated by nearly 3600 unique users logging into workstions with folder redirection. I could totally be wrong, but this opinion has been formed from over 4yrs of working with DPM and trying to keep up to date backups of highly accessed files.

    Thanks,


    Jeffrey S. Patton Jeffrey S. Patton Systems Specialist, Enterprise Systems University of Kansas 1001 Sunnyside Ave. Lawrence, KS. 66045 (785) 864-0242 | http://patton-tech.com

    Friday, July 27, 2012 5:52 PM
  • Hi Wayne,

    here is what we are succesfully using:

    we have dpm 2010 and since last month 2012 servers with our three windows 2008 r2 fileservers converted to storage servers. We backup about 100tb of fileserver data with 2 dpm servers without problems (in relation to the size of data). After some general initial issues with dpm the first 3 month, it's working fine for over a year. Our underlying san are multiple iscsi 1gbit/10gbit equallogic arrays. 

    We have about 1500+ users accessing the files. File shares are from 10-15tb with 2-3+ million files. Format is ntfs with 4k or 16k block size.

    We are using dedicated fileservers with 2processors and about 32gb ram, same with dpm servers. Some things we had to adjust with dpm. We had to set a higher value for agent timeout (15mins) I think. With that setting we fixed communication errors to remote host. 

    Another issue I remember was, if to many files are broken or not accessible (for example broken acls, virus, etc.) caused failed backups for us. Error message for dpm was a bit misleading as it also stated connection to remote host lost, but reading dpm logs showed the errors.

    The definetly most bullshit feature of dpm is the use of dynamic disks. We had so many issues with broken dynamic disks. Don't know why they can't just use normal disks or smb shares. 

    We also had some normal and premier cases with ms regarding storage server and dpm, but microsoft knowledge with this data size and combination is not that great. Although dpm is working fine for us, it is probably not made for this kind of usage.

    Deduplication hopefully would get a way better with native dedup in server 2012 as dpm will then really aware of the host dedup.

    best wishes,

    Stefan


    • Edited by mediasyst Thursday, August 23, 2012 9:03 PM
    Thursday, August 23, 2012 8:53 PM
  • @mediasyst, how did you convert your servers to storage servers? Windows 2008 Storage Server is an OEM sku, so I'm curious about that. Also, it sounds like you are now backing up your servers with DPM 2012? is that correct?

    In the discussions I've had with Microsoft, and Wayne's group and by your accounts I think it's safe to say, DPM is awesome for certain types of backups. also, if you've been backing up your SIS enabled data on 2012 for about a month now, I would like to get an update after a few months, just to see if you've encountered any issues. I know when we rolled to DPM 2010 we didn't have any problems until we started to get several recovery points, and heavy load.

    Thanks,


    Jeffrey S. Patton Jeffrey S. Patton Systems Specialist, Enterprise Systems University of Kansas 1001 Sunnyside Ave. Lawrence, KS. 66045 (785) 864-0242 | http://patton-tech.com

    Thursday, August 23, 2012 10:17 PM
  • Hello Jeffrey,

    there is a patch for storage server branding (KB982050) and sis filter driver (KB976833) installation. The storage server itself is just an upgrade I think. DPM 2012 feels much more stable than DPM 2010 from the release as we test since DPM 2012 beta. It feels more like DPM 2010 qfe4.

    best wishes,

    Stefan

    Friday, August 24, 2012 10:35 AM
  • Hello Jeffrey,

    there is a patch for storage server branding (KB982050) and sis filter driver (KB976833) installation. The storage server itself is just an upgrade I think. DPM 2012 feels much more stable than DPM 2010 from the release as we test since DPM 2012 beta. It feels more like DPM 2010 qfe4.

    best wishes,

    Stefan

    Thanks for the feedback, Stefan.  Are you actually using SIS on many of your storage server volumes?  Once we enabled SIS on our 50+ volumes we have had all sorts of issues ranging from server deadlocks (see sticky thread "DPM Server becomes unresponsive / hangs when protecting Storage Server SIS enabled volume" in General forum) to the other issues outlined in my last post.  

    Our biggest problem now is the problem where the SYSTEM registry hive has grown to 200+ MB in size on each of our file cluster nodes which causes a 2+ hour delay for each node to boot up to a usable state.  There is a hot-fix available for servers with Hyper-V installed, but that doesn't do us much good.  The symptoms are the same but the cause is probably slightly different.


    Friday, August 24, 2012 6:09 PM
  • Hello Justin,

    we are using single baremetal server for our fileserver. We tested virtualized and also clustered servers but we stay with the normal servers. We have cluster solution for other things like db and printing but I think clustering is not that optimal for large fileservers as far as we have tested. Sis is enabled on all our volumes.

    best wishes stefan

    Thursday, August 30, 2012 4:55 PM
  • One last question.  Have you run "sisadmin /v <volumes>" against your SIS'd volumes to see what your SIS savings are?  


    Tuesday, September 4, 2012 6:02 PM
  • This is the hotfix you need to fix your issue. It wasnt public at the time we had the same issues. I had to open a case with MS. Now the hotfix is public. This fixed the lockups we had.

    http://support.microsoft.com/kb/2608658/EN-US

    Wednesday, September 5, 2012 4:14 PM
  • We applied that hotfix long ago before it was even public.  That fixed our lockup issues, but we still have serious unpatched issues that remain handling SIS volumes in DPM.  
    Wednesday, September 5, 2012 5:24 PM
  • Just want to chime in and say I had nothing but problems getting consistent replicas of the data volume on my implementation of Windows Storage Server 2008 R2 Enterprise. I would run a manual copy for the replica creation, then the initial consistency check would run for 4-6 days, claim 90% of the items needed "fixing", and would copy 8-9TB before failing (there's about 10TB on the volume) with an error stating the agent was in an unexpected state. I verified hotfix levels, tried throttling the agent, disabled antivirus, added RAM to both the protected server and the DPM server, etc. etc. Nothing made any improvment.

    Based solely on this thread, I disabled SIS, ran another manual replica copy, and started another initial consistency check. It ran for 38 hours, fixed about 2% of the items, and transferred 200GB. And it worked.

    SIS was garnering us about 1.2TB of data savings. It is really disappointing to have to disable it, but after a month of unsucessfully trying to get a replica with SIS enabled, then getting a replica with no problem and being in sync for the last two weeks with SIS disabled, the choice is pretty easy to make.

    SIS and DPM don't seem to play well together.


    Thursday, October 11, 2012 8:11 PM
  • Thanks for sharing your experiences.  Your last sentence sums things up nicely.  We are now waiting for MS to help resolve the issue on our file cluster (registry bloat) before we de-SIS and nuke/re-deploy our DPM servers.  We will certainly miss the SIS savings, but the savings aren't worth the troubles.  
    Wednesday, October 17, 2012 2:26 PM
  • Hello All,

    FYI - We are investigating DPM performance issues related to SIS enabled volumes.  We have discovered that if you have files what were SISed in the Recycle bin that it will negatively effect CC performance, so I would make sure that the recycle bin is emptied on SIS enabled volumes.


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread. Regards, Mike J. [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights.

    Wednesday, October 17, 2012 9:33 PM
    Moderator
  • Hey Mike,

    I know this is an old thread, but I also just spent over 2 weeks working with Microsoft on a support call.  2008 Storage Server, SIS volumes, being protected in a primary to secondary DPM 2010 environment (DPM installed on 2008 R2 SP1).  The volumes were disappearing, and was unable to write to tape, etc.. 

    In my case we ended up installing the LDR of KB2608658 on the DPM server, and while most issues were resolved, it did not stop the volumes from disappearing.

    What I found, and wondering if you have seen this, is that the DPM server was being backed up including a system state backup.  When the system state backup is initiated on the DPM server, something in the VSS operation after the WBENGINE service started up found information about the reparse points of the SIS common store (the protected data) and kicked those volumes 'offline' , and gave a few hundred thousand Ntfs 141 alerts.  Barking about the SIS common store.  Doesn't make much sense, but, it is what it is.  

    I've stopped protection of the system state and file system of the DPM server to let things settle down, and am debating on going any further with the MS disk support team (they offered, but I've already wasted a month on this problem).  Not sure if it is worth your time, but I remembered going through this post rather thoroughly before spending money on a support call.  

    Regards,

    Kyle

    Tuesday, December 18, 2012 3:15 PM
  • Mike:  Is there any update on the performance issue for SIS volumes on the DPM side?  We are about to lose 10TB in savings by de-SIS’ing our disks and it would be great if this could be avoided.  Sadly, backing up 40TB of non-SIS data will spool to tape faster than 30TB of SIS’d data given our experience. 

    Could the performance issue with SIS also be due to using Server 2008 on the DPM side instead of 2008 R2?   Our hands are tied here due to missing volumes on 2008R2.  SC2012 SP1 isn't out so we can deploy DPM 2012 on Server 2012 which I really wish we could use.  

    I pray that chunk dedupe on a 2012 server won't have similar issues (performance and reliability) being backed up by DPM 2012. 

    Tuesday, December 18, 2012 4:44 PM
  • Kyle, thanks for the addition to the thread and I’m sorry that you have had to go through some of the same pain we have endured over the last year.  

    As it turns out, we were finally able to repair the registry issues this last weekend which have plagued our Storage Server cluster since before May.  Three registry hives were growing out of control due to a bug (no hotfix yet, so manual deletion and reboot was the fix) in VSS.

    Unfortunately we never found a solution for the missing volume problem on 2008 R2.  Please keep us informed with your progress, and I hope you find a satisfactory resolution quickly.  

    Tuesday, December 18, 2012 4:47 PM
  • Unfortunately, it appears that DPM 2012 SP1 still has the bug where consistency checks on a good SIS volume will "fix" a large number of files causing the recovery point volume to grow very large.  I tested by restoring a SIS'd volume to a test storage server and subsequently protecting this test server by a fresh 2012 SP1 install.  The groveler service was disabled and SIS usage was fairly heavy (114,165 common store files; 358,817 link files; 430GB saved) on this test volume.  

    I waited for DPM to complete a few days worth of syncs and recovery points (all static data) and ran a manual CC.  The result:  1,161,386 fixed files out of 1,231,128 scanned and 650GB transferred.  

    *sigh*  

    Tuesday, January 1, 2013 7:15 PM
  • We are not running DPM.  We have a 2-node Storage Server 2008r2 file server cluster(~5 million files=20TB) and we are thinking about turning on SIS.  I did check and we do have the deadlock hotfix(979040) applied.  We do failover between the nodes quite often for MS updates, etc and I do not want to enable SIS, if it will increase the changes that the volumes would be marked as dirty, etc.  What do you guys think?  Are we in scope of having the same possible issues mentioned in this thread?

    Thanks,

    Dan

     
    • Edited by CaptLazarus Wednesday, February 20, 2013 12:03 AM
    Wednesday, February 20, 2013 12:02 AM
  • Hey CaptLazarus,

    No, I don't think you are going to have issues. However, I also think you are in the wrong forum if you are not using DPM. I would seek verification of your concerns elsewhere as all of the problems here relate to File Protection of SIS volumes using Data Protection Manager 2010,2012, SC SP1.

    Its a different story if you are going to be using DPM in the future to protect those SIS volumes. In which case, this forum thread speaks for itself.

    Regards,

    Kyle


    Wednesday, February 20, 2013 3:30 PM
  • Thanks Kyle - I was not sure if we were in scope and I do appreciate your feedback
    Wednesday, February 20, 2013 3:55 PM