none
CSV stays at "Backup in progress, redirected access" long after the backups are done

    Question

  • I've got an issue that has come up a few times in the last 2 weeks or so. I first noticed this while I was troubleshooting a different issue (CSV failure when ownership changes during backup), but before we installed the hotfix which resolved that issue.

    Now, about 4 times in the last 2 weeks, I've found the CSV on my cluster in redirected status when it shouldn't be.  it actually shows "Backup in progress, redirected access".  It won't let me turn off redirected mode, or even take the csv offline.  

    This blog post on troubleshooting redirected access seems to describe exactly my issue, under #3:

    http://blogs.technet.com/b/askcore/archive/2010/12/16/troubleshooting-redirected-access-on-a-cluster-shared-volume-csv.aspx

    But it seems to point fingers at the backup application, which would be DPM in this case.  

    What could be going wrong that causes the cluster not to know the backup is done and release the csv?

    Friday, March 02, 2012 8:30 PM

All replies

  • Could you please provide me the below information and provide the cluster logs & corresponding events associated with this issue.

    1.        Which VSS provider you are using? Whether Microsoft VSS or storage VSS provider?
    2.        Which storage box you are using?
    3.        Whether it FC or  iSCSCI connection?

    Sajeed AM

    Monday, March 05, 2012 9:10 AM
  • 1. HP Storageworks VSS provider

    2. HP P2000 G3 

    3. FC

    Monday, March 05, 2012 2:39 PM
  • Hi,

    Please let us know what is the Backup solution you are using.. Try the following,

    1. Please ensure you have installed the Command view EVA on one of the cluster node

    2.Start COMMAND as Administrator. And type vssadmin list providers and confirm HP Storageworks VSS provider is listing

    3. Please run the command vssadmin List Shadows while disk is in the redirected mode access.

    Normally snapshot operation will take the disk in the redirected mode and automatically switch back to direct access mode. This would take maximum of 3 to 5 minutes. If it is exceeding the time run the command in the step 3 and verify which snapshot is running. If it is Microsoft provider, reconfigure the  Storage works VSS once again.  or still it is HP VSS provider, contact the HP vender.


    Sajeed AM

    • Proposed as answer by Shabarinath Monday, March 05, 2012 5:12 PM
    • Unproposed as answer by Gai-jin Wednesday, March 07, 2012 3:00 PM
    Monday, March 05, 2012 4:03 PM
  • I'm not sure what you're asking by "what is the backup solution you are using"?  We use DPM, if that's what you're looking for.

    1) I'm not familiar with that software.  Since it has EVA in the name, I'm guessing that it may not be applicable to the MSA line.  

    2) HP Storageworks P2000/MSA2000 VSS Provider and MS Sofwtare shadow copy provider 1.0 are both listed

    3) "No items found that satisfy the query." 

    Just to be clear, the csv stays in this "in progress" state for hours after the backup jobs are done.  We're not talking about a few minutes of redirected access.  It has reset itself back to online status a couple of times now when it was stuck.  I don't have exact times, but it seemed to be about 24 hours after it first went into the "Backup in progress, redirected access" status. 

    Tuesday, March 06, 2012 6:41 PM
  •  

    It is clear that snapshot operation is hang. Please capture the output of vssadmin list providers


    Sajeed AM

    Wednesday, March 07, 2012 5:52 AM
  • Here it is:
    C:\Windows\system32>vssadmin list providers
    vssadmin 1.1 - Volume Shadow Copy Service administrative command-line tool
    (C) Copyright 2001-2005 Microsoft Corp.
    
    Provider name: 'Microsoft Software Shadow Copy provider 1.0'
       Provider type: System
       Provider Id: {b5946137-7b9f-4925-af80-51abd60b20d5}
       Version: 1.0.0.7
    
    Provider name: 'HP StorageWorks P2000/MSA2000 VSS Provider'
       Provider type: Hardware
       Provider Id: {bd04cbf9-212c-4553-9ea5-c5bfb05ccc8f}
       Version: 2.8.0.19
    
    
    C:\Windows\system32>

    Unfortunately, this is an intermittent problem, and it isn't in this 'stuck' state right now. It happens every few days, and I'm not sure what the trigger is, so I can't reproduce the issue on demand.
    Wednesday, March 07, 2012 3:00 PM
  • Sorry for the delay :(

    It seems that Microsoft Software Shadow copy also enabled. Before further troubleshooting please ensure that at storage side you have changed the Host type as Windows 2008. If it is windows please change to windows 2008.                                                                                                 If still the disk is in the redirect access mode, run the command vssadmin List Shadows to confirm whether any snapshot is present. That is the case we can clear the snapshot and take the disk to direct access manually.      

    To troubleshoot further and a permanent solution , I want to know the backup configuration in detail. Please describe the steps you followed on Host machines and storage side for Backup.   Also on the backup schedules.

                               


    Sajeed AM

    Sunday, March 11, 2012 2:09 PM
  • I expect the MS VSS provider is there because it is used for backups of the Host, is that not what you would expect to see?

    I'm not familiar with anywhere in the P2000 interface where I can set what OS the host uses.  I looked through the settings again, but I don't see anything like that.  

    The csv are once again stuck in the BIP state, and still the vssadmin list shadows command shows "No items foudn that satisfy the query".  The shadows are getting deleted, but the cluster doesn't seem to get the notice that the backup job is done and it should return the csv to normal state.

    What do you want to know about the configuration?

    The DPM server is an HP DL380 G7, running windows 2008R2 SP1, DPM 2010 (ver 3.0.7707.0), backing up to local SATA drives and a Quantum tape library.

    The Cluster consists of 2 HP DL580 G7 servers running server 2008R2 SP1, and an HP P2000 G3 SAN.  The san is directly connected to each server via FC.  

    The cluster and backups have been working together fine for the most part up until late last year.  At that time, we occasionally saw an issue with the CSV failing.  We were able to determine that the CSV failed when the VM load was split between the two nodes of the cluster and the backup of the guests started.  When the cluster finished snapshotting the csv for one node and tried to switch ownership to the other node, the csv would fail.  This only occurred if we used the hardware vss provider. I worked with HP for weeks troubleshooting from a SAN point of view, and with MS for a week or so trying to troubleshoot this issue from the OS side.  Turns out, the solution didn't come from my ticket with MS or HP, but in the form of a recently released hotfix (KB2637197) that someoen from MS advised me of in my thread about the issue in the dpm forum.  That hotfix did seem to correct the CSV failure.  however, a week or so before that hotfix was applied, I noticed this problem with the drive staying in BIP status.  I wonder if this "Stuck in BIP state" issue might have the same root cause as the failing CSV issue.  It seems like they are very similar, so I tend to think they are likely related on some level.  I don't know what changed that caused the CSV failure issue to start, but if the same thing would cause the stuck in BIP issue, that issue would be masked since the CSV failed before it would have ever been released to return to normal I/O. 

    The cluster is currently hosting 4 freshly built test servers, just for the purpose of troubleshooting this issue.  Production guests were moved from the cluster to run as local guests on one of the hyperv nodes, due to the CSV problems we've been fighting.  The 3 test guests are running 2008 R2 w/ SP1.  They have no applications or roles installed, and no load on them, they sit idle all day except for being backed up.  Guest #1,3 run on node 1, with Guests #2,4 on node 2.  

    Before we started having any issues, dpm backed up the hyper-v guests once a day, overnight.  Once production load was off of the cluster, the test servers were built and set up with a more frequent backup schedule, for the purpose of testing the various changes/fixes we made with HP and MS troubleshooting the problem.  As of now, the backups are running several times a day, in some cases with 1 hour between backups, in other cases 2-4 hours, depending on time of day.  The test backup jobs never take more than 10 minutes total when they work correctly, so even 1 hour between recovery points should allow more than enough time.

    The MS support case is still open, but the support engineer believes this to be a SAN issue.  HP support engineer is reviewing the logs again, in conjunction with the cluster logs the MS engineer found some errors in, but doesn't believe it's on their end.  At this point, MS support is in a holding pattern waiting for HP to show that it's not a SAN issue. 

    Does that cover everything you wanted?




    • Edited by Gai-jin Tuesday, March 13, 2012 4:10 PM
    Tuesday, March 13, 2012 3:24 PM
  • Is it possible that this problem could be caused somehow by having cluster guests running on node2 and local (not clustered) guests running on node 2 as well, with both being backed up in the same timeframe?  


    As I'm trying to piece together when this problem started, it seems like it was right after we first moved the production servers off the cluster to run locally on the NODE2 server, and added TEST servers running on the cluster to use for troubleshooting another issue.  Both the production servers, running as local hyper-v guests, and the test servers, running as cluster resources on each node, would be getting backed up around the same time of day.  The doesn't happen every day, but when it does it's always around the same time, and it always seems to be NODE2 that doesn't release the CSV from the "backup in progress" state when the backup of the cluster guest is finished.

    Monday, April 02, 2012 6:03 PM
  • Thre is a recently released hotfix available that addresses similar issues.  Please install this fix which is scheduled to be included in Service Pack 2 for Windows Server 2008 R2 on each of the cluster nodes. 

    2674551 Redirected mode is enabled unexpectedly in a Cluster Shared Volume when you are running a third-party application in a Windows Server 2008 R2-based cluster
    http://support.microsoft.com/default.aspx?scid=kb;EN-US;2674551

    Thre are additional details about the problem in a blog post titled "FIXED: Cluster Shared Volumes (CSV) in redirected access mode after installing McAfee VSE 8.7 Patch 5 or 8.8 Patch 1" at  http://blogs.technet.com/b/askcore/archive/2012/03/18/fixed-cluster-shared-volumes-csv-in-redirected-access-mode-after-installing-mcafee-vse-8-7-patch-5-or-8-8-patch-1.aspx.  Please try the hotfix even if you are not using the McAfee AV solution.

    Friday, April 27, 2012 7:47 PM
  • I've seen this with your exact SAN config.

    And i have a workaround.

    NB: HOST = HYPERV HOST nodo fo cluster.

    In brief the host VSS component communicate with the SAN thru CAPI proxy.

    CAPY proxy get some info reading some info on first LUN of the SAN ( the one with the lowest ID on mapping )

    This FAIL if the HOST has a Persistent Reservation  on the first LUN , saying that it cannot talk with the controller.

    So the VSS provide fail the snapshot in 1 HOST and not in the OTHERS ( if you have a N node cluster ) , in a very pseudorandombutelusiveway.

    pseudorandombutelusiveway because the CSV it's migrated in the host where the VM is running, so it's difficult to find the logic of the problem, and the problem is hidden in an elusive file of log of the CAPI component, and because if you use a disk as Quorum often you use the first disk mapped, the correlation is elusive.

    SOLUTION: if you have N host

    create N tiny LUN

    map the FIRST LUN to the FIRST HOST with ID 0

    map SECOND LUN to the SECOND HOST with ID 0

    and so on for every HOST

    then DO NOT USE THESE LUN'S..

    REALLY, DO NO USE IT

    Because of this the FIRST LUN of HOST will never PERSISTENT RESERVATED by MS CLUSTERING , VSS will works on all host.

    Also Srorage Manager for san can give SAN info on all nodes of cluster.

    The sad issue is that HP support seem unaware of this and i've not found a way to find who is in charge to RECEIVE support info about this problem.

    I Know it seem a joke, but it isn't.

    I've tried also to post in the desert called HP SUPPORT FORUM, with no result.

    the post who ispired me is http://h30499.www3.hp.com/t5/Disk-Array/CAPI-VDS-VSS-problems/m-p/5695987/highlight/true#M45633

    my post was here http://h30499.www3.hp.com/t5/Disk-Array/Issues-with-VSS-and-VDS-on-Win-2008-R2-SP1-cluster-MSA2324i/m-p/5737973/highlight/true#M45682

    If someone has some saint in HP please point to it the problem.


    • Edited by Manfri Thursday, March 28, 2013 8:36 PM
    Thursday, March 28, 2013 8:35 PM