none
Backup of VM puts CSV in redirected mode and fails RRS feed

  • Question

  • I have a HyperV cluster with 3 nodes all running server 2008 R2.

    All nodes are connected to a HP P2000 SAS SAN

    There are a few VMs living on this cluster, all Windows 2008 R2 systems.

    Backup of these systems work fine.

     

    Last week i did a P2V of a Windows SBS 2008 box, and backup of this server fails.

    When trying to do a child partition backup of this server i the job in DPM will go on for a while and end with an event error ID 3114.

    I can see on the cluster that right after the backup job starts the CSV wolume changes into redirected mode, then it goes in and out of redirected mode until the backup job fails.

    If i create a recoverypoint of any of the other VMs the CSV volume does not go into redirected mode while running the backup, and it ends successfully.

    From the eventlog on the NODE that the SBS VM lives i find these errors.

    event 8 volsnap:

    The flush and hold writes operation on volume \\?\Volume{c7b814a9-0958-441e-ad1e-8d8de912b1d3} timed out while waiting for a release writes command.

     

    event 12298 VSS:

     

    Volume Shadow Copy Service error: The I/O writes cannot be held during the shadow copy creation period on volume \\?\Volume{c7b814a9-0958-441e-ad1e-8d8de912b1d3}\. The volume index in the shadow copy set is 0. Error details: Open[0x00000000, The operation completed successfully.

    ], Flush[0x00000000, The operation completed successfully.

    ], Release[0x80042314, The shadow copy provider timed out while holding writes to the volume being shadow copied. This is probably due to excessive activity on the volume by an application or a system service. Try again later when activity on the volume is reduced.

    ], OnRun[0x00000000, The operation completed successfully.

    ]. 

     

    Operation:

       Executing Asynchronous Operation

     

    Context:

       Current State: DoSnapshotSet

     

     

     

     

    Both these errors comes repetedly every 30 seconds or so until the backup job fails.

    No data is thransferred to the DPM server during the job.

     

    Since this affects only one VM i can only think that there is something inside the VM itself that is causing this?

    It is the only SBS server in the cluster.

     

    vssadmin list writers on the node that this vm resides on, will output :

     

    Writer name: 'Microsoft Hyper-V VSS Writer'

       Writer Id: {66841cd4-6ded-4f4b-8f17-fd23f8ddc3de}

       Writer Instance Id: {541ec79f-5efc-41df-9397-01d4bb37880b}

       State: [9] Failed

       Last error: Timed out

     

     

     

    If anyone have any idea on how to solve this, please let me know.


    Tuesday, January 17, 2012 10:45 AM

Answers

  • After alot of testing i an finally found the problem.

    There was problems with the firmware version in the SAN. After uppgrading the firmware the backup now performs as expected.

    Before i upgraded the firmware i could se in the san log that as the backup was started, the snapshot was created in the san. Then as soon as it was committed and available, it was deleted. In windows i got the error messages that is described above.

    I applied different hotfixes that made the problems smaller but they never solved it completely.

    http://support.microsoft.com/kb/2549533       The cluster service fails because of timeouts in the writer

    http://support.microsoft.com/kb/2494162  Increase the time and number of retries before writer fails.

    Theese two hotfixes made my life better but never made the problem go completely away.

    I could still see in the san log that snap shots were created, but deletes emediately after they were committed. After hotfixes where applied  the snapshots was created and deleted 5 - 10 times before it was finally successful and data transfer started. Before hotfixes it tried a few times and then the job failed.

    Now that correct firmware is on the san, i can see the snapshot is greated then committed, and then data transfer starts on the first attempt.



    • Marked as answer by Fredrik Weme Thursday, February 23, 2012 9:50 AM
    • Edited by Fredrik Weme Thursday, February 23, 2012 9:52 AM
    Thursday, February 23, 2012 9:49 AM

All replies

  • The CSV volume where this VM is running is owned by the same cluster node or a different node?


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights
    Tuesday, January 17, 2012 11:40 AM
  • the CSV volume is owned by the same NODE as the VM in question is running on.

    Didnt mention this earlier, but we are using hardware providers, and this seem to be working fine for all other VM's.

     

     

    After writing the inital post i removed the VM from the protection group, and deleted the replica. 

    I re added the the VM and started the new syncronisation. for over 40 minutes the backup job did nothing other than posting the same event errors as started above. But suddenly now as i am writing this it kickstarted a consistency check and is now copying the replica over and maxing out the bandwith on the backup network.

    While this is happening the CSV volume is NOT in redirected mode.

     

    SO, since i was also able to get a initial replica of this VM the first time it seems now that i am :

    Not able to create a initial replica.

    Not able to create recoverypoint.

    But i AM able to get the backup done through a consistency check. 

    ??

    Why does create recovery point fail (no data transferred what so ever), when consistancy check works.

     

    Theese events where posted on the NODE just before the consistency check started

    Event 1 VDS Basic Provider

    Unexpected failure. Error code: 490@01010004

    Event 3 FilterManager

    Filter Manager failed to attach to volume '\Device\HarddiskVolume280'.  This volume will be unavailable for filtering until a reboot.  The final status was 0xc03a001c.

    Event 51 Disk

    An error was detected on device \Device\Harddisk5\DR68 during a paging operation.

     

     

    Event 3 and 51 was posted several times with the same message.

     

    When the consistancy check finished the backup was OK, i i now had 1 good recoverypoint.

     

    I then tried to create another recoverypoint (express backup)

    And this again fails with the same errors as earlier. And the CSV enters redirected access.

     

     

     

     

     

     

    Dont know if this has any relevance, but this DPM server is also backing up a different cluster on a different domain wich does not have hardware providers. All backups are running fine from this cluster as well.

     

     

     


    Tuesday, January 17, 2012 12:05 PM
  • Hi Fredrik.

    Consistency Check (CC) is a mechanism DPM uses to compare what is in the replace volume to what you have on the production environment. In your case, because initial replica didn't transfer a thing, the CC was like 'the initial replica'.

    One of the 'mandates' for a successfully recovery point (RP) is that the replica needs to be in a consistent state and no other job should be running against that source when a scheduled recovery point starts.

    From what you are saying the only successful job was the CC and subsequent RPs failed as well.

    If you try another CC will that work? If needed, for troubleshooting purposes we can force DPM to use the software provider to see how this behaves. Did you check the VM itself to check if you get errors there (application/System log) at the time you were trying to run the backup?

     


    Thanks, Wilson Souza - MSFT This posting is provided "AS IS" with no warranties, and confers no rights
    Tuesday, January 17, 2012 6:11 PM
  • Hi again Wilson, and thank you for taking the time to reply.

     

    Since last post the backup has been completing succsessfully.

    However when the backup starts, no matter whether i trigger recoverypoint or it starts from schedule, it will still throw the same errors in the event log on the node. And the CSV volume will go into redirected acces. But now this will happen only once, earlier this happened continuously untill the backup failed.

    Now after the first errors and the csv has gone into redirected mode, the backup will continue and complete the backup with the CSV volume in normal mode.

    There are no new event errors after the initial ones, only informational event log entries thats normal when the backup completes successfully.

     

    This still only happes during backup of this one particular VM. It does not matter on which node it lives. And only thing that separates this VM from the others is that this one is a SBS 2008 that is converted from physical to virtual using scvmm.

    Ar there anything i could look for inside the VM's own logs? Something that could disturb the Cluster Nodes writers?


    Thursday, January 19, 2012 2:46 PM
  • After alot of testing i an finally found the problem.

    There was problems with the firmware version in the SAN. After uppgrading the firmware the backup now performs as expected.

    Before i upgraded the firmware i could se in the san log that as the backup was started, the snapshot was created in the san. Then as soon as it was committed and available, it was deleted. In windows i got the error messages that is described above.

    I applied different hotfixes that made the problems smaller but they never solved it completely.

    http://support.microsoft.com/kb/2549533       The cluster service fails because of timeouts in the writer

    http://support.microsoft.com/kb/2494162  Increase the time and number of retries before writer fails.

    Theese two hotfixes made my life better but never made the problem go completely away.

    I could still see in the san log that snap shots were created, but deletes emediately after they were committed. After hotfixes where applied  the snapshots was created and deleted 5 - 10 times before it was finally successful and data transfer started. Before hotfixes it tried a few times and then the job failed.

    Now that correct firmware is on the san, i can see the snapshot is greated then committed, and then data transfer starts on the first attempt.



    • Marked as answer by Fredrik Weme Thursday, February 23, 2012 9:50 AM
    • Edited by Fredrik Weme Thursday, February 23, 2012 9:52 AM
    Thursday, February 23, 2012 9:49 AM