none
Hyper-V Production Checkpoint - "Application state wasn't included..." & Veeam Backups Failing RRS feed

  • Question

  • I have been working with Veeam support on an issue that is causing our backups to fail due to VSS issues. Veeam fails the job because it sees failures in the VSS writers.

    When I run a "production" backup in Hyper-V, which should grab the "application state" I receive the below message, which to mean contradicts what a "production" backup is supposed to do, backup up the application state.

    When I run: "vssadmin list writers", I do see a few that say "Timed out"

    The issue is across W2K12R2, W10, W2K16 and W2K19 guest systems.

    We have also looked over the event log and nothing stands out as well as what is causing the issue. I created a clone of one of the VMs and re-registered the .DLLs but that did not fix it as well. 

    Production Checkpoint -  Application State Wasn't Included

    Coming from VMware I am still working on learning some of the different technologies in Hyper-V. Any help is greatly appreciated. 

    Thank you

    Wednesday, July 3, 2019 4:58 PM

All replies

  • When I run a "production" backup in Hyper-V, which should grab the "application state"

    No, that's not true. "[Running] application state" -- the "running" qualifier from the message is important -- captures the entire system, which involves CPU state, the active contents of memory, and in-flight I/O. Only a standard checkpoint can capture that.

    Production checkpoints depend on VSS writers in the guest. Applications that subscribe to VSS or provide their own writers will flush their I/O and pause operations while VSS captures a snapshot. We call the resulting backup "application consistent", but it is NOT a capture of the running application state.

    You can test this by reverting to a standard checkpoint and then to a production checkpoint. The standard checkpoint reverts to the EXACT condition that the VM was in when the checkpoint was captured. The production checkpoint will act as though the guest OS had crashed, but if the applications are VSS-aware, they would have no data loss.

    The VSS writers problem is something else, though. But your checkpoints appear to be working as intended.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, July 3, 2019 5:49 PM
  • Eric,

    Thank you for your reply. Do you recommend "Production" or "Standard" checkpoints? In coming from a VMware environment I am used to a standard snapshot, so the VSS checkpoint process is new to me.

    With regards to your last comment on VSS writers. What are your suggestions on trying to fix this?

    The VSS writers problem is something else, though. But your checkpoints appear to be working as intended.

    Wednesday, July 3, 2019 7:18 PM
  • I don't think that question has an easy answer. The only simple recommendation I have: only use checkpoints infrequently and for very short periods. I don't use vSphere often enough to remember but I believe that it also has two different snapshot modes that roughly equal the standard/production choice... one saves memory state etc. and the other doesn't, or something like that.

    I have an article that goes into greater detail. More information on standard vs. production starts about 1/3 of the way down. The parts on choosing between standard and production start just past the halfway mark.

    Also, since you got here in relation to a backup question, the checkpoints automatically created by VSS during a host-level backup are neither standard nor production checkpoints. They are more like the production style because they trigger guest VSS operations.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, July 3, 2019 9:27 PM
  • On the VSS problems that you're experiencing, I haven't even tried to fix a VSS problem in a while, and never in conjunction with the Veeam product.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, July 3, 2019 9:28 PM
  • Hi ,

    Just want to confirm the current situations.

    Please feel free to let us know if you need further assistance.                   

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Friday, July 5, 2019 4:45 AM
  • Eric,

    The Veeam portion is just additional information. If I pull up the writer status it does say timed out, so I would like to resolve the VSS issue first. 

    Can anyone offer assistance with VSS troubleshooting?

    On the VSS problems that you're experiencing, I haven't even tried to fix a VSS problem in a while, and never in conjunction with the Veeam product.

    Friday, July 5, 2019 7:26 PM
  • Eric,

    In a VMware environment, typically the snapshot is a point in time. If the machine is powered on there is the option for quiescence which copies the memory of the VM as well. 

    My concern is for VMs that have SQL and a production checkpoint is taken, is it actually grabbing the running state of SQL or any transactional database so that when it is restored, everything is as it should be. That is why that message concerns me that it is not getting the running state. 


    Friday, July 5, 2019 7:38 PM
  • Eric,

    In a VMware environment, typically the snapshot is a point in time. If the machine is powered on there is the option for quiescence which copies the memory of the VM as well. 

    My concern is for VMs that have SQL and a production checkpoint is taken, is it actually grabbing the running state of SQL or any transactional database so that when it is restored, everything is as it should be. That is why that message concerns me that it is not getting the running state. 


    Capturing running state sounds better than it is. For any system serving non-local clients or acting as a client itself, it can cause problems. With production and standard checkpoints, Hyper-V will trigger VSS in the guest. VSS-aware applications, like Microsoft SQL, will respond by flushing everything out to disk so that there's nothing useful in memory or CPU to capture. That's good because if you ever recover to such a checkpoint, the VM will act like it's booting up from a crash state. It will know that something happened and every application will be given a chance to start fresh and possibly carry out recovery operations if they have that capability. With a standard checkpoint, the VM knows nothing except that it's clock somehow got behind. All of those in-flight transactions that clients gave up on or assumed had committed hours, days, months, weeks, or however long ago, the VM continues along with them like nothing had happened. If the system is a client to another system, it does the same. If it's a partner or member of some kind of HA setup, who knows what will happen? What if you recover that checkpoint and then, for whatever reason, decide that you need to recover to that checkpoint again?

    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Sunday, July 7, 2019 3:19 PM
  • Capturing running state sounds better than it is. For any system serving non-local clients or acting as a client itself, it can cause problems. With production and standard checkpoints, Hyper-V will trigger VSS in the guest. VSS-aware applications, like Microsoft SQL, will respond by flushing everything out to disk so that there's nothing useful in memory or CPU to capture. That's good because if you ever recover to such a checkpoint, the VM will act like it's booting up from a crash state. It will know that something happened and every application will be given a chance to start fresh and possibly carry out recovery operations if they have that capability. With a standard checkpoint, the VM knows nothing except that it's clock somehow got behind. All of those in-flight transactions that clients gave up on or assumed had committed hours, days, months, weeks, or however long ago, the VM continues along with them like nothing had happened. If the system is a client to another system, it does the same. If it's a partner or member of some kind of HA setup, who knows what will happen? What if you recover that checkpoint and then, for whatever reason, decide that you need to recover to that checkpoint again?

    Eric,

    The SQL portion does make sense and what I would expect out of a VSS operation. With that being said, does VSS (Production level) only make sense on VMs that have transactional databases installed on them? A system that hosts DHCP, Jenkins or something similar would not really take advantage of VSS or would they? 

    Your previous comment on Production level checkpoints making the Guest OS look like it had crashed. This has confused me for a while as I kept wondering why when I logged into the VM it always said it did not shut down properly. With your explanation, it makes sense that since the backup is kicking off a production checkpoint level backup, that VSS is running and creating that state.

    With all of the above, this takes me back to my VSS question, why do the VSS writers and others show a "timed out" state when they run and look as though they have failed when a "production" level checkpoint is created. Bringing back in the backup scenario as these are writers are not in a clean state, it causes our backups to fail. I know you work with Altaro , so bias aside I am not asking to fix that portion but ultimately what do I have to do to get these VMs to report a non failed VSS state. What am I missing on the VSS state?

    Monday, July 8, 2019 2:04 PM
  • I would use production checkpoints for anything VSS-aware without question. I would be inclined to use them with any non-VSS-aware application that has a robust recovery model. Really, I would default to the production checkpoint and only change to standard after a reasoned argument. For example, I know that once upon a time, MySQL was not VSS-aware and would suffer unrecoverable data mangling if its host lost power. If that's still the case today, then I would use standard checkpoints with MySQL. For DHCP, I would feel safe with production checkpoints. I am unfamiliar with Jenkins' architecture.

    I promise that if I knew how to fix your VSS problem, I would tell you. The Altaro team has earned my respect so I don't mind throwing clicks their way, but they're going to be OK no matter what. If your backup vendor works for you, we're good; I'm just glad that you're taking your backups seriously enough to HAVE a backup vendor. I will not deliberately leave anyone hanging if I can help. I understand VSS at a procedural level but have not needed to work on it in several years, that's all. If this were happening to me, I would start digging into the VMs and the host looking for the answer to "why can't the snapshot complete in time?" Performance barriers that only cause meaningful problems during heavy activity like a backup? Application bound up? In-guest VSS broken? Any results from a fishing expedition in the logs? I recall taking VSSADMIN.EXE for a spin a few times, that might help.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Monday, July 8, 2019 2:53 PM
  • I have reviewed the VSS Vadmin portion and we even did a VSS trace. The odd thing is it looks like it completes but the VSS writers are in a timed out state. The same "complete" message that I posted earlier is the same message I get for both VMs only one shows "timed out" and the other does not but both complete. I have tried moving the VMs to other hosts, same issue. 10 of them have time out errors and 2 complete. 

    The heaviest 2 storage and processor utilization VMs complete fine from a backup standpoint (no VSS time out) but a simple DHCP server does not. 

    This is the VSSadmin output:

    Anything standout?

    Friday, July 12, 2019 9:43 PM