none
HCI Node Error 5120 on Reboot with DPM Agent Installed RRS feed

  • Question

  • Hi,

    I have newly deployed, 3 node, Windows 2016 (Build 2724) Hyper-Converged Hyper-V/S2D cluster which is operating normally. The intention is to use DPM 2016 LTSB UR6 to perform VM level backups hosted on the cluster.

    After deploying the DPM agents (5.0.375.0) to the cluster nodes, i've started to see cluster errors generated when rebooting a node, CSV auto pause errors for each CSV in the cluster:

    "CSVName has entered a pause state because of 'STATUS_UNEXPECTED_NETWORK_ERROR (c00000c4)"

    When rebooting a node, there are no backup jobs running, if I remove the DPM agents the nodes reboot without error.

    Also, it may be a red herring, but there are entries in the cluster logs generated during a node reboot that relate to an error closing a file named ' DpmFilterTrace.txt' which is stored on the CSVs. Not sure if this is normal or an indication of an issue.

    Would appreciate any input from people protecting workloads on HCI clusters using DPM.

    Thanks in advance,

    James

    Tuesday, May 21, 2019 7:59 AM

All replies

  • Hello James,

    DPM fully supports backing up Hyper-V virtual machines stored on Storage Spaces Direct (S2D), however I have seen these 5120 errors before when backing up clusters. I do believe they might have to do with backups, but I experienced these errors not only with DPM but other backup software as well, I cannot recall what was the cause of this, but I think there was some misconfiguration somewhere.

    There was also an update in late 2018 that fixed some CSV issues on Windows Server 2016 Hyper-V clusters, make sure that your S2D cluster nodes are fully up-to-date.

    I suggest you also run the cluster validation wizard to see if there are any warnings/errors as well.

    Also make sure that your hardware is compatible with your operating system, you can check it over here:
    https://www.windowsservercatalog.com


    Note:
    Update Rollup 7 has been released for DPM 2016.

    Update Rollup 7 for System Center 2016 Data Protection Manager

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, May 21, 2019 8:27 AM
  • Hi Leon,

    Thanks for the reply. I don't believe this is being caused by backup activity as this occurs when there are no protection groups configured. I could understand 5120's popping up due to PFC/QoS misconfiguration under a backup load but I can't see that in this case.

    I'm happy with the cluster configuration including S2D and a cluster validation passes with flying colours. I'm also aware of last years issues with S2D and am currently patched to Jan 2019.

    I'll take a look at UR7, thanks.

    James


    Tuesday, May 21, 2019 8:55 AM
  • Since you're not backing anything up yet, you could try to start from a clean state by deleting all the DpmFilter files from the System Volume Information.

    To delete the DpmFilter files, follow the steps below:

    1. Open a Command Prompt (Admin) and run:

    fltmc unload DpmFilter             

    (Run on all cluster nodes)

    2. From one of the Cluster nodes, run:

    psexec -s cmd

    (PSExec is https://docs.microsoft.com/en-us/sysinternals/downloads/psexec tool used to run cmd under SYSTEM account)

    3. In this new Command Prompt (running as SYSTEM) run the following:

    CD C:\ClusterStorage
    For /d %i in (volume*) do del /s "%i\System Volume Information\DpmFilter*" 

    Note: The * includes (DPMFilterBitmap{Guid}, DPMFilterStatus, DPMFilterLog, DPMFilterTrace*)


    4. Verify that the files are deleted by running the following:

    "dir /s /b DpmFilter*"
    exit


    5.
    Then load the filter again on all nodes

    fltmc load DpmFilter

    (On all cluster nodes)


    Blog: https://thesystemcenterblog.com LinkedIn:

    • Marked as answer by james87878978 Tuesday, June 11, 2019 3:00 PM
    • Unmarked as answer by james87878978 Tuesday, June 18, 2019 10:31 AM
    Tuesday, May 21, 2019 9:10 AM
  • Hi Leon,

    Does that stem from an MS KB somewhere?

    The DPM agent has been removed and reinstalled from each node. Does that not complete the same task?

    James

    Tuesday, May 21, 2019 10:12 AM
  • There’s no KB srticle for this unfortunately, an agent reinstall should be more or less the same.

    Are you certain that these errors did not occur before installing the DPM agents?

    Do you see any VSS related errors in the Application log of the cluster nodes?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, May 21, 2019 10:32 AM
  • I'm fairly confident the errors didn't previously occur but it is a new cluster. If i remove the DPM agents i don't seem the get the errors whereas they appear on every reboot when DPM agents are installed. 

    No VSS errors in the App log.

    Tuesday, May 21, 2019 10:55 AM
  • Although you have your servers fully patched, you could have a look at the document in the link below, it isn't exactly the same error as you're getting but they are similar:

    Troubleshoot Storage Spaces Direct
    https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/troubleshooting-storage-spaces#event-5120-with-statusiotimeout-c00000b5

    Since you are not yet using DPM, I don't think the DPM agent is to blame, normally these can be caused by VSS pausing the I/O when preparing snapshot.

    For example: If you have a backup job (or any other activity that can trigger creation of a snapshot) configured to run at the time this error is logged, then you can safely ignore it.


    Do you notice any strange behavior or any issues when these errors are happening? If not, you could try backing up some virtual machines and see how it behaves.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, May 21, 2019 12:47 PM
  • I spent last summer with MS support on that issue so know it pretty well. It's not a related issue. Plus i have other clusters that are identical but without DPM that are not experiencing an issue.

    I agree that backup activities per se are not the cause, but having just retested with and without the DPM agent installed the agent just by being there seems to cause the error, just by being installed and idle, i'm not sure why but it does.

    No strange behaviour that i've noticed, just the logged 5120 cluster errors. It is like the DPM agent has a file lock on each CSV, as per the log entry i mentioned at the start of the thread. Not sure why I seem to be the only one experiencing this though.

    I'm going to try UR7 along with a later Windows CU.

    James

    Tuesday, May 21, 2019 2:01 PM
  • Let us know how it goes, if the situation stays the same then it might be worth escalating this to Microsoft support once again.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, May 21, 2019 2:04 PM
  • No change with UR7 and May CU. Time for a support call.
    Wednesday, May 22, 2019 2:00 PM