none
File Server 2019 hangs when backed up by DPM 2019 RRS feed

  • Question

  • Hello everybody,

    At a customer's site, there is a DPM 2019 backing up 2 - 3 TB of files. Everything is running Windows Server 2019. The file server is running on a VMware cluster, DPM is a physical machine. The volumes to be backed up are ReFS formatted. DPM is using modern backup storage. All patches are current.

    Wenn we try to back up the volumes with the files of the file server, the file server becomes unresponsive or nearly unresponsive. We see that SYSTEM occupies on CPU core 100% for hours, but the file server cannot even reboot. We had to hard turn off the VM and restart several times now.

    What can we do in order to properly back up that file server's files with DPM again please?


    Best Regards, Stefan Falk

    Wednesday, April 1, 2020 4:16 PM

All replies

  • Hi Stefan,

    Can you identify which process is consuming all CPU performance?

    Also what kind of hardware is the file server running on?

    Does the file server have any antivirus software installed? Or are you using Windows Defender?

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, April 1, 2020 4:26 PM
  • Hello Leon,

    Thanks for your super-fast response!

    - According to task manager, it is the "system" process.

    - The fileserver is running virtually on VMware. Shall I ask for details on the underlying hardware?

    - The file server does not have any antivirus software installed.

    - Windows Defender is still installed, but we had turned off real-time scanning already for it.


    Best Regards, Stefan Falk

    Wednesday, April 1, 2020 4:32 PM
  • How much memory and vCPUs does the file server have? Just to get an idea.

    You could ensure that the Windows Defender has the following exclusions in place:

    DPM installation folders:

    • %Program Files%\Microsoft System Center\DPM\DPM\XSD
    • %ProgramFiles%\Microsoft System Center\DPM\DPM\Temp\MTA
    • %Program Files%\Microsoft System Center\DPM\bin

    Processes

    The following process should be excluded from real-time scanning:

    • DPMRA.exe
    • CSC.exe


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, April 1, 2020 4:39 PM
  • Hello Leon,

    The fileserver has 4 virtual CPUs and fixed 16 GB RAM (8.8 GB RAM in use currently).

    As I wrote, realtime scan is totally deactivated in Windows Defender, so we should not have to deal with scan exceptions, do we?

    However, Cloud-based protection was still active. I disabled that. Could that be a cause for the fileserver to become unresponsive just because we try to backup it?


    Best Regards, Stefan Falk


    • Edited by Stefan Falk Monday, April 6, 2020 2:15 PM typo
    Monday, April 6, 2020 2:15 PM
  • I would still make sure all antivirus exclusions are set in place, I never trust antivirus software whether they're off/on, enabled/disabled.

    How many backups are running on the file server during the 100% CPU usage peak? A single one or multiple?

    I've not witnessed 100% CPU usage during DPM backups, is there much other activity on your file server during the DPM backups?

    Also note that DPM 2019 RTM does not support backing up ReFS formatted volumes, support for backing up workloads on ReFS formatted volumes was added with Update Rollup 1, more info HERE.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, April 8, 2020 9:24 PM
  • Hello Leon,

    Thanks again for your input.

    DPM 2019 RU1 was successfully installed on 2020-03-04.

    I have excluded the whole "%Program Files%\Microsoft Data Protection Manager\DPM" folder and the processes csc.exe, dpmra.exe from Windows Defender on the file server. I try to get a schedule with the customer for testing (fearing that the fileserver is dead again, so we must plan ;-)

    On that fileserver, only two data volumes are backed up. It was only one NTFS volume (where the problem suddenly but persistently occured, after months of working fine), and just to try if things get better, we moved the data to two ReFS volumes in the last weeks (but the problem kept existing).

    Only the formerly one NTFS and now two ReFS volumes are backed up. We synchronize once an hour and do 3 recovery points per day distributed over work times.

    Best Regards, Stefan Falk

    Saturday, April 11, 2020 11:26 AM
  • This is strange indeed, I have not experienced this kind of issue previously.

    Putting the CPU usage on the side for a moment, how are the backups performing, are they completing? or are they also getting stuck? Or are they simply slow?

    Are there any software installed on the file server that could be interfering with the backups? Any other backup software or other scheduled jobs?

    How many files are those 2-3 TB of data? Hundreds of thousands or millions?

    Are the volumes deduplicated?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, April 15, 2020 10:27 AM
  • Hello Leon,

    The backups do not complete even after waiting several *days*.

    Installed are:

    - the file server role, deduplication, and Resource Manager for file server (just for monitoring),

    - VMware tools (to get the virtualized NIC drivers),

    - and surprisingly something called "XXConsole - Super Console Generator". Apparently, somebody had installed that without writing it into our log document. Maybe it got installed on-the-fly by installing a tool to sync the files from the NTFS to the ReFS volumes. It seems to be this: http://www.xxcopy.com/xxcopy43.htm. I will uninstall this. I will nevertheless uninstall that at the next opportunity.

    Nothing else is installed on that file server VM.

    One of the ReFS drives has exactly 3,778,039 files currently, the other one has 5,881,211 files. The NTFS volume was deduped, the ReFS volumes currently are not deduped. Only the ReFS volumes are in backup currently and nevertheless the file server hangs when backing them up.

    I have got a schedule to retry the backup after adding the Windows Defender exceptions tomorrow night and of course will post the results here.


    Best Regards, Stefan Falk


    • Edited by Stefan Falk Thursday, April 16, 2020 5:20 PM updated file counts
    Thursday, April 16, 2020 4:57 PM
  • Okay, the next thing would be to start checking the DPM logs for any clues, you'll find the log over here:

    DPM server

    • %ProgramFiles%\Microsoft System Center\DPM\DPM\Temp\DPMRACurr.errlog

    Protected server

    • %ProgramFiles%\Microsoft Data Protection Manager\DPM\Temp\DPMRACurr.errlog

    You can upload the logs to Microsoft OneDrive / Google Drive and share the link here.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, April 16, 2020 5:52 PM
  • Hello Leon,

    I could not find a direct hint to the problem, but you can download the log files of the DPM and the file server from https://1drv.ms/u/s!AnuxuhxP0zblsbJfiPZNL5wfcD5Frg?e=f2TMt9

    Please tell me when you have done so, I will remove the file from OneDrive then.

    I tried consistency checks last Friday evening and things seemed fine at first: CPU usage was low, and data flew to the DPM server. However, the operation leaved the volumes marked inconsistent in DPM.

    The customer retried a consistency check Monday morning and the file server was again fully unresponsive. He hard-booted the file server VM on 2020-04-20 06:34:30 (German time) according to the event log of the file server.

    Hopefully your eyes find something valuable in the logs. If I can do anything else, please let me know. Thank you!


    Best Regards, Stefan Falk

    Tuesday, April 21, 2020 4:59 PM
  • There's a lot of logs, can you tell approximately when the freeze happened? In a time frame and date, as there are thousands of lines of logs.

    Also have you considered using throttling on your file server?


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, April 22, 2020 8:04 AM
  • Hello Leon,

    Thanks for looking into the logs! The freeze must have happened about max. 30 minutes before the reboot. The customer startet the consitency check, saw the file server become unresponsive, and rebootet the fileserver.


    Best Regards, Stefan Falk

    Wednesday, April 22, 2020 8:20 AM
  • Hello Leon,

    Do you happen to have news for me?


    Best Regards, Stefan Falk

    Monday, April 27, 2020 8:31 AM
  • Hi Stefan,

    Apologies for the late reply, there were some warnings in the logs, a lot about the following:

    WARNING	Parameter: [0x80070002], FilePath = \\?\Volume{3aab9c43-fc06-4532-a8d3-463aa07c55ba}\d48bc208-99fb-4cfd-990b-d2cf90abb174\FailedFilesLog.txt

    To open the FailedFilesLog.txt you can refer to the steps in the following thread:
    What is the best strategy for opening the FailedFilesLog.txt ?

    Did you try using throttling as I mentioned earlier?

    As this is something I haven't experienced before, before doing any changes I would set up a Performance Monitor on the file server and monitor the following components:

    • Processor
    • Memory
    • Network Interfaces
    • Logical Disks

    You can refer to the link below on how to use Performance Monitor:
    How to use Performance Monitor on Windows 10

    This is simply to get a pattern to when this happens, and maybe get a better understanding why it happens.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, April 27, 2020 8:56 AM
  • Hello Leon,

    Thanks for your message. We did not throttle yet - but could this cause a simple file server to "die"?

    I will check the FailedFilesLog.

    If I take the perfmon log you mentioned, how would it help me diagnose the problem? Shall I make records about how many seconds it were until the machine goes wild?



    Best Regards, Stefan Falk

    Monday, April 27, 2020 9:03 AM
  • High network traffic can cause a server to halt, I'm not sure if DPM can do this but I've seen it with other applications.

    The Performance Monitor would help us know closer the time-frame when this happens, and to also get a trend to understand this better.

    Another option would be to try performing a clean install of the DPM agent.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, April 27, 2020 9:16 AM
  • Hi 

    hi you tell that the Volumes are deduped, if you backup deduped volumes your DPM server also needs to have Dedup feature installed, or ells it cant understand the way the volume is structured

    So and a new Drive to your fileserver, Enable dedupe, on DPM and Fileserver, Move the files, backup the files with DPM. Sync the drives and move the share.

    Monday, June 29, 2020 8:59 AM
  • Hello Torben,

    Thanks for your input. We are aware of that and have Dedup installed on the DPM from day one. So that should not be a problem.


    Best Regards, Stefan Falk

    Monday, June 29, 2020 9:12 AM
  • Hi Stefan,

    Would it be possible for you to collect Procmon in the next occurrence? When you start collecting the Procmon , don't use Virtual Memory to capture the data instead select the Local volume where Procmon will save all its traces.
    Click on File -> Backing Files -> Select a local folder. Collect the traces for 2-3 mins and share it if possible. I know it would be difficult to capture the procmon during that time but if you can, please share. If procmon is not possible, try collecting Process explorer and capture it before the server becomes completely unresponsive.

    Monday, June 29, 2020 10:49 AM
  • Hello Aayoosh,

    I assume you mean procmon on the file server, right?

    Yes, we can do this, but I need to make an appointment with the customer, as trying the backup will freeze the file server and the servers are being used 24/7. I'll return here.


    Best Regards, Stefan Falk

    Monday, June 29, 2020 11:05 AM
  • Yes - Procmon and Process Explorer needs to be collected on the Protected File Server at the time of high CPU usage. While collecting Process Explorer, Double Click on System process (or the highest utilized Process at that time) and go to the THREADS and take a screenshot.

    Whats is the OS build on the File Server? Run WINVER on the runbox and share the build number and also share the REFS.sys driver version on that File server.

    Do you have any other file server with REFS volume where DPM backup is working fine? If yes, share the OS build and REFS.sys driver version as well.

    Monday, June 29, 2020 11:59 AM
  • Hello Aayoosh,

    the file server's winvers tells Windows Server 2019 version 1809 build 17763.1282 and refs.sys version 10.0.17763.1192.

    Unfortunately, this seems to be the only ReFS file server being backed up by DPM to which we have access.

    I have got a schedule to test and take the procmon logs this friday 20:00 (German time) and will report here what I find.


    Best Regards, Stefan Falk

    Monday, July 6, 2020 8:54 AM
  • REFS version: 10.0.17763.1192 looks buggy to me. There are handful of customers complaining about performance issue when using 10.0.17763.1192 REFS version. Please share the data so that I can confirm the same. If you don't want to make the data publicly available, you can share the download link in my LinkedIn DM so that only I can access it.
    Monday, July 6, 2020 9:30 AM
  • Hi Stefan,

    I would also be interested in analyzing the memory dump for the hung server. Since the File Server VM running on Vmware, follow below link to generate a Memory Dump so that we can see if there is lock happening.

    http://www.vmwarearena.com/how-to-generate-crash-dump-for-vmware-virtual-machine-guest-os-hung-issues/

    See if you can find a way to share it.



    Monday, July 6, 2020 4:26 PM
  • Hello Aayoosh,

    Today I wanted to test things out. However, procmon and procmon64 insists that "Capture requires administrator group membership". Of course, I am logged on as an admin (both domain admin and local admin tried), I tried to explicitely "execute as admin", and I checked that the local policy allows the local administrators group to load and unload device drivers. Procmon was freshly loaded from live.sysinternals.com.

    Wtf?


    Best Regards, Stefan Falk

    Friday, July 10, 2020 6:10 PM
  • Is it only with the File server or you are facing this issue with other server as well? Try with some other server which is on the same OU as file server

    Friday, July 10, 2020 6:20 PM
  • Hello Aayoosh,

    Thanks for the super-fast response. I tried exactly that and found that the local administrators group has this privilege. That should be sufficient, right?


    Best Regards, Stefan Falk

    Friday, July 10, 2020 6:49 PM
  • It should be enough. It looks like ProcMon is having issues attaching its filter driver hence its throwing that error.

    Also try to execute the exe using "Run different user" then type either local admin or Domain admin to launch it. If it still doesn't work then try with Psexec to launch Procmon using system privileges. Use below command:

    PsExec.exe -s Procmon.exe


    Friday, July 10, 2020 7:00 PM
  • Hello Aayoosh,

    I tried run as, failed the same way. I tried PS C:\> .\PsExec.exe -i C:\Procmon.exe, same error. Very strange, I never had this issue with procmon for decades...


    Best Regards, Stefan Falk

    Friday, July 10, 2020 7:11 PM
  • Yes - I have not seen either. Okay try this method, it should work.

    open cmd using Psexec with System account and from new CMD windows launch Procmon.exe. So technically we are trying to open Procmon using system account directly. Lets see if this works for you.

    Friday, July 10, 2020 7:13 PM
  • Thanks again!

    I tried

    psexec -i cmd.exe

    CMD popped up. In there i tried c:\procmon.exe - and the same error message appeared.

    However: psexec -i -s did the job. SORRY! I read -i instead of -s.

    Procmon is running now. I just have to geht the customer disturbed in his friday evening (it is 21:20 here) to check whether I may do the damaging test still (this is a 24/7 shop). I'll return here!


    Best Regards, Stefan Falk

    Friday, July 10, 2020 7:19 PM
  • Great! 

    Dont forget to change the backing up option I mentioned earlier. All the logs needs to be saved on the disk instead in RAM. Also, if you can generate the Memory dump that will be helpful. Explained in below link:

    http://www.vmwarearena.com/how-to-generate-crash-dump-for-vmware-virtual-machine-guest-os-hung-issues/

    Having extra data can be helpful since you can not have a downtime every now and then.

    Friday, July 10, 2020 7:23 PM
  • Hi again,

    Yes, procmon writes to a file in C:\ (hopefully, space will be sufficient). DPM is just doing a consistency check of the file server's ReFS volume. Note: The customer told me he remembered that that hang occured also when we backed up the files from a NTFS volume (we copied the files to a new ReFS volume on a new virtual disk to be sure that the NTFS file system or its underlying virtual disk would not be damaged).

    When the file server hangs, I will try to produce the memory dump. Can I mail you the link to all the files privately?


    Best Regards, Stefan Falk

    Friday, July 10, 2020 7:32 PM
  • Okay- so if it hangs while backing up NTFS volume also then It is not related to REFS driver. I think it is more of a windows related issue. It could also be a bottleneck on the OS itself. 

    Yes you surely DM me privately on my LinkedIn account and share the download link.

    Friday, July 10, 2020 7:37 PM
  • I'm not on LinkedIn, nor on Xing nor on any other social platform :-) Any other way?

    Best Regards, Stefan Falk

    Friday, July 10, 2020 7:38 PM
  • Okay no problem- You can drop me an email at aayoosh.moitro@gmail.com 
    Friday, July 10, 2020 7:42 PM
  • Very kind - thank you! I'll mail you the links when the tests are finished. Thank you so much for your valued support.

    Best Regards, Stefan Falk

    Friday, July 10, 2020 7:46 PM
  • Hello Aayoosh,

    This is a bit funny: Backups went smooth! Procmon logged until C: was nearly full, and the consistency checks of the large NTFS and ReFS volumes of the file server went just fine. Up to now, we have no errors.

    My assumption is that some update for Windows or DPM really fixed something. We've got another PSS call for DPM an VMware and see similar improvements.

    So, what can I say? I would return here if we go into problems again. And until then, thank you very much for your assistance!


    Best Regards, Stefan Falk


    • Edited by Stefan Falk Monday, July 13, 2020 3:10 PM clarified
    Monday, July 13, 2020 3:10 PM
  • Hi Stefan,

    Appreciate the update. Could you please check if you are still at the same REFS version on the File server i.e. 10.0.17763.1192 ?


    Monday, July 13, 2020 3:12 PM
  • Hello Aayoosh,

    You are _really_ fast :-)

    Yes, refs.sys is still version 10.0.17763.1192.


    Best Regards, Stefan Falk

    Monday, July 13, 2020 3:15 PM
  • Well then lets keep it under monitoring for next few days, if issue reoccurs, report back here. :)

    Monday, July 13, 2020 3:18 PM