none
DPM 2019 VM hangs almost every day RRS feed

  • Question

  • Hello.

    I have a Hyper-V DPM 2019 UR1 VM, which is backing up just a couple volumes from another VM. The protection group is configured to create RP every hour.

    What happens is that on 9/10 days (roughly), the VM hangs at around 0:05. Shortly after midnight, the CPU utilization rises (see the below image for an example) - probably due to a RP/sync job running and if I recall correctly, there's DPM DB maintenance running around this time.

    Since the VM doesn't crash or reboot or generate a memory dump, there are no events (errors) in the eventlog that would hint at what's wrong. The VM simply stops responding, if I'm logged in at that time, I get disconnected (timed out - RDP), VM console is black with no logon screen etc.

    Any ideas what might be causing this and how to troubleshoot this further?

    The Hyper-V host (Hyper-V Server 2019) this DPM VM runs on is fine, there are other DPM VMs running on it, which have no problems of this kind. VM config version is 9.

    I'm about to reinstall the DPM VM, but I'd like to troubleshoot this further before going that route.

    Wednesday, June 17, 2020 7:06 AM

Answers

  • Hi,

    Is the virtual machine under provisioned? A VM becoming unresponsive usually have these effects when the VM cannot handle the load going on.

    Does it have sufficient vCPUs and RAM? 

    Is it running the SQL Server locally?

    During midnight DPM does its pruning/maintenance job so running any other job at the same time can slow things down and cause high CPU, but the CPU usage is still not that alarming, if it would be stuck on 95-100% then you've got more issues.

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    • Marked as answer by MarkosP Friday, July 17, 2020 9:03 AM
    Wednesday, June 17, 2020 8:20 AM
  • Hi Markos,

    Thanks for uploading the Memory DUMP. Below is my analysis:

    I checked for certain types of locks in the dump like spinlock, qlocks, deadlocks etc but I found none, So I didnt see any abnormalities in that aspect. However below is my analysis which shows why the server went unresponsive. 

    First I checked the CPU configuration and noticed that DPM server has only 2 cores:

    1: kd> !cpuinfo
    CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features
     0  6,85,4 GenuineIntel 3492 ffffffff00000000                   211b3dfe
     1  6,85,4 GenuineIntel 3492 ffffffff00000000                   211b3dfe
                          Cached Update Signature ffffffff00000000
                         Initial Update Signature ffffffff00000000


    After that I looked for the top 5 processes with highest utilization: 

     Pid ImageName                        Commit   SharedCommit        Debt
    
      ed4 sqlservr.exe                  382856 Kb        2640 Kb        0 Kb
     1174 ReportingServicesService      206076 Kb        2308 Kb        0 Kb
     1410 RSPortal.exe                  184008 Kb        2580 Kb        0 Kb
      ef8 RSHostingService.exe          136896 Kb        2368 Kb        0 Kb
     1130 msdpm.exe                     133196 Kb        2268 Kb        0 Kb


    After that I looked for active locks currently help in the memory

    0: kd> !locks
    **** DUMP OF ALL RESOURCE OBJECTS ****
    KD: Scanning for held locks...
    
    Resource @ 0xffff9d01dbce3cc8    Shared 1 owning threads
         Threads: ffff9d01ded05040-01<*> 
    280727 total locks, 27 locks currently held


    I noticed there were many threads which were at lock state. I looked at few and below is the example:

    Thread No 1:

    !thread ffff9d01ded05040
    
    0: kd> !thread ffff9d01ded05040
    THREAD ffff9d01ded05040  Cid 0004.10a4  Teb: 0000000000000000 Win32Thread: 0000000000000000 READY on processor 80000001
    IRP List:
        ffff9d01d5921bb0: (0006,03a0) Flags: 00060043  Mdl: ffff9d0130cee390
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      16602307       Ticks: 1185 (0:00:00:18.515)
    Context Switch Count      122321         IdealProcessor: 1             
    UserTime                  00:00:00.000
    KernelTime                00:06:58.281
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ebff80dd0 Current fffffe8ebff7e1d0
    Base fffffe8ebff81000 Limit fffffe8ebff7b000 Call 0000000000000000
    Priority 15 BasePriority 13 PriorityDecrement 32 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`bff7e210 fffff806`71ec38e7 : ffff9d01`ded05040 ffff9d01`ded05040 00000000`0000600a fffff806`00000005 : nt!KiSwapContext+0x76
    fffffe8e`bff7e350 fffff806`71ec3459 : ffff9d01`ded05040 00000000`00000000 ffff9d01`2c3cbae0 8a000001`31aa9863 : nt!KiSwapThread+0x297
    fffffe8e`bff7e410 fffff806`71ec21e0 : fffffe8e`bff7e800 00000000`00000000 00000000`00000000 fffffe8e`bff7e521 : nt!KiCommitThreadWait+0x549
    fffffe8e`bff7e4b0 fffff806`74e04580 : fffffe8e`bff7e608 fffffe8e`00000000 00000000`cd1f4000 00000000`00000000 : nt!KeWaitForSingleObject+0x520
    fffffe8e`bff7e580 fffff806`74dddc6f : 00000000`00000000 00000000`00000000 ffff9d01`2edfea70 ffff9d01`00000103 : ReFS!RefsMinstoreRead+0x194
    fffffe8e`bff7e690 fffff806`74ddc170 : fffffe8e`bff7e870 ffff9d01`d5921bb0 ffff8587`b6af4001 00000000`00000000 : ReFS!RefsCommonRead+0x16df
    fffffe8e`bff7e850 fffff806`71f093c9 : ffff9d01`d5921bb0 fffff806`72039db0 ffff9d01`30cee2d0 ffff9d01`30cee390 : ReFS!RefsFsdRead+0x440
    fffffe8e`bff7eb70 fffff806`71e157c0 : ffff9d01`2f132030 ffff9d01`d5921bb0 ffff9d01`30cee2d0 ffff9d01`30cee390 : nt!IofCallDriver+0x59
    fffffe8e`bff7ebb0 fffff806`71e686e9 : ffff9d01`30cee280 ffff9d01`30cee2a0 ffff9d01`30cee2e0 ffff9d01`30cee2d0 : nt!IoPageReadEx+0x188
    fffffe8e`bff7ec00 fffff806`71e99d5d : 00000000`00000002 fffffe8e`bff7ec90 fffffe8e`bff7edf8 fffffe7f`2161eda8 : nt!MiIssueHardFaultIo+0xc1
    fffffe8e`bff7ec50 fffff806`71eb0dbf : ffff8000`00000000 00000000`c0033333 ffff8587`b6af4000 ffff8587`b6af4000 : nt!MiIssueHardFault+0x3ed
    fffffe8e`bff7ed00 fffff806`71e090fc : fffffe7f`2161eda8 fffff806`7223e6c0 fffffe7f`2161eda8 ffff8587`b6af4000 : nt!MmAccessFault+0x32f
    fffffe8e`bff7eea0 fffff806`71eaed3f : 00000000`00000001 ffff8587`b6af4000 fffffe42`c3db57a0 fffffe7f`2161eda8 : nt!MiFaultInProbeAddress+0xa4
    fffffe8e`bff7ef50 fffff806`71eadd7e : fffffe8e`bff7f080 fffff806`7223b7c0 fffffe8e`bff7f0c0 00000000`00000000 : nt!MiLockPageLeafPageTable+0x27f
    fffffe8e`bff7efc0 fffff806`71ead099 : ffff9d01`00000000 00000000`00000000 00000000`00000000 ffff9d01`dbce3b00 : nt!MiProbeAndLockPages+0x39e
    fffffe8e`bff7f110 fffff806`74df9992 : fffffe8e`bff7f700 ffff9d01`2f24a000 ffffc505`ce1e2100 00000000`00000000 : nt!MmProbeAndLockPages+0x29
    fffffe8e`bff7f140 fffff806`74dd72b4 : ffff9d01`dbce3b28 00000000`00000000 ffff9d01`dbce3d40 ffff9d02`1984e110 : ReFS!MsKmeLockPages+0x52
    fffffe8e`bff7f190 fffff806`74dd7132 : fffffe8e`bff7f7a0 00000000`00000000 fffffe8e`bff7f700 fffffe8e`bff7f701 : ReFS!CmsVolume::MakePageResident+0xb4
    fffffe8e`bff7f3e0 fffff806`74dd6ff0 : 00000000`00000001 ffff9d01`73aea008 fffff806`74dd6fe0 fffffe8e`bff7f7a0 : ReFS!CmsVolume::ReadPageWorker+0x132
    fffffe8e`bff7f570 fffff806`71eca9f8 : fffffe8e`bff7f750 00000000`00000000 fffffe8e`bff7f780 fffffe8e`bff7fa38 : ReFS!CmsVolume::ReadPageCallout+0x10
    fffffe8e`bff7f5a0 fffff806`71eca96d : fffff806`74dd6fe0 fffffe8e`bff7f7a0 00000000`00000000 00000000`00000000 : nt!KeExpandKernelStackAndCalloutInternal+0x78
    fffffe8e`bff7f610 fffff806`74db709b : ffff9d02`1ee00000 00000000`001991f4 ffff9d01`00000000 00000000`00000240 : nt!KeExpandKernelStackAndCalloutEx+0x1d
    fffffe8e`bff7f650 fffff806`74db59cf : 00000000`00000000 ffff9d01`2f24dc30 ffff7949`1513c850 00000000`00000000 : ReFS!CmsVolume::PinSinglePageWorker+0x112b
    fffffe8e`bff7f870 fffff806`74dc7741 : 00000000`0000002e 00000000`00000000 00000000`00000001 00000000`00000000 : ReFS!CmsVolume::PinCommon+0x14f
    fffffe8e`bff7fb60 fffff806`74dc7114 : ffff8588`09e00058 fffffe8e`bff7fd20 ffff9d01`2c6b9920 ffff8588`09e00ad0 : ReFS!CmsCompositeBase::PinChild+0xb1
    fffffe8e`bff7fc20 fffff806`74da547b : ffff8588`09e04058 fffffe8e`bff80040 ffff9d01`73aea008 00000000`00000000 : ReFS!CmsBPlusTable::AttemptSiblingMerge+0x300
    fffffe8e`bff7ff40 fffff806`74ded107 : ffff9d01`2c6b9920 fffffe8e`bff80299 ffff9d01`b7f35b68 ffff9d01`2c6b9920 : ReFS!CmsBPlusTable::TryToCollapseFromPage+0x53f
    fffffe8e`bff80200 fffff806`74de98fc : 00000000`00000000 00000000`00000000 ffff9d01`2c6b9920 ffff9d01`2c6b9a80 : ReFS!CmsBPlusTable::TryToCollapseTree+0xef
    fffffe8e`bff80300 fffff806`74de95c2 : 00000000`00000000 00000000`00000000 ffff9d01`2c6b9920 fffff806`71f632be : ReFS!CmsBPlusTable::FailableTreeUpdate+0x160
    fffffe8e`bff80370 fffff806`74de6f69 : ffff9d01`00000001 ffff9d01`2f46b920 fffffe8e`bff80550 ffff9d01`2f46b920 : ReFS!CmsBPlusTable::RunFailableTreeUpdateForAllDirtyTrees+0x2a2
    fffffe8e`bff80450 fffff806`74debdce : ffff9d01`73aea008 fffffe8e`bff809d0 ffff9d01`2f24a900 ffff9d01`73aea008 : ReFS!CmsBPlusTable::UpdateBoundTrees+0xf29
    fffffe8e`bff808d0 fffff806`74df0a9b : ffff9d01`2f24a010 ffff9d01`2f24a940 ffff9d01`2f24a930 fffff806`72474088 : ReFS!CmsVolume::Checkpoint+0x622
    fffffe8e`bff80bc0 fffff806`74dfb937 : 00000000`00000000 fffff806`7235d240 00000000`00000000 00000000`00000000 : ReFS!MspCheckpointVolume+0x3b
    fffffe8e`bff80c60 fffff806`71f0715a : ffff9d01`2984e950 ffff9d01`ded05040 fffff806`72238000 00000000`00000000 : ReFS!MspWorkerRoutine+0x47
    fffffe8e`bff80cb0 fffff806`71e793a5 : ffff9d01`ded05040 ffff9d01`29880040 ffff9d01`ded05040 00000000`00000000 : nt!ExpWorkerThread+0x16a
    fffffe8e`bff80d50 fffff806`71fbef2c : ffffb100`62928180 ffff9d01`ded05040 fffff806`71e79350 00000000`00000246 : nt!PspSystemThreadStartup+0x55
    fffffe8e`bff80da0 00000000`00000000 : fffffe8e`bff81000 fffffe8e`bff7b000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c
    

    Thread 2:

    0: kd> !thread ffff9d01320a4040
    THREAD ffff9d01320a4040  Cid 0004.0b14  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
        ffff9d012f24a580  NotificationEvent
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      9432095        Ticks: 7171397 (1:07:07:33.078)
    Context Switch Count      102174         IdealProcessor: 0             
    UserTime                  00:00:00.000
    KernelTime                00:00:52.125
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ec008bdd0 Current fffffe8ec008b270
    Base fffffe8ec008c000 Limit fffffe8ec0086000 Call 0000000000000000
    Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`c008b2b0 fffff806`71ec38e7 : 00000000`00000000 ffff9d01`320a4040 ffff9d01`00000002 fffff806`7235b100 : nt!KiSwapContext+0x76
    fffffe8e`c008b3f0 fffff806`71ec3459 : ffff9d01`320a4040 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSwapThread+0x297
    fffffe8e`c008b4b0 fffff806`71ec21e0 : 00000000`00000000 00000000`00000000 00000000`00000000 fffffe8e`c008b5c1 : nt!KiCommitThreadWait+0x549
    fffffe8e`c008b550 fffff806`74de62a0 : ffff9d01`2f24a580 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeWaitForSingleObject+0x520
    fffffe8e`c008b620 fffff806`74de9d94 : ffff9d01`320a4140 00000000`00000000 00000000`00000000 00000000`00000000 : ReFS!CmsBPlusTable::UpdateBoundTrees+0x260
    fffffe8e`c008baa0 fffff806`74dd379a : 00000000`00000001 fffff806`74dfb8f0 ffff9d01`2f24a010 fffff806`74dded00 : ReFS!CmsBPlusTable::UpdateTree+0x64
    fffffe8e`c008bb00 fffff806`74dd330a : 00000000`00000001 ffff9d01`2f24a010 ffff9d01`2f497ac8 fffff806`71ec3459 : ReFS!MsUpdateTree+0x32
    fffffe8e`c008bb60 fffff806`74dfb937 : ffff9d01`9c0d1050 fffff806`00000000 ffff9d01`00000000 00000000`00000000 : ReFS!MspLazyWriterWorker+0x21a
    fffffe8e`c008bc60 fffff806`71f0715a : ffff9d01`2984e950 ffff9d01`320a4040 fffff806`7223e300 ffff9d01`00000000 : ReFS!MspWorkerRoutine+0x47
    fffffe8e`c008bcb0 fffff806`71e793a5 : ffff9d01`320a4040 ffff9d01`29880040 ffff9d01`320a4040 00000000`00000000 : nt!ExpWorkerThread+0x16a
    fffffe8e`c008bd50 fffff806`71fbef2c : ffffb100`62928180 ffff9d01`320a4040 fffff806`71e79350 00000000`00000246 : nt!PspSystemThreadStartup+0x55
    fffffe8e`c008bda0 00000000`00000000 : fffffe8e`c008c000 fffffe8e`c0086000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c

    Thread 3:

    0: kd> !thread ffff9d01d36f40c0
    THREAD ffff9d01d36f40c0  Cid 0004.1a60  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
        fffff80f2e470188  NotificationEvent
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      11506897       Ticks: 5096595 (0:22:07:14.296)
    Context Switch Count      51552          IdealProcessor: 0             
    UserTime                  00:00:00.000
    KernelTime                00:00:41.812
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ebf005dd0 Current fffffe8ebf005880
    Base fffffe8ebf006000 Limit fffffe8ebf000000 Call 0000000000000000
    Priority 15 BasePriority 12 PriorityDecrement 32 IoPriority 2 PagePriority 5
    Page 80031ea3 too large to be in the dump file.
    Page 80031ea3 too large to be in the dump file.
    Page 800a91fe too large to be in the dump file.
    Page 800a91fe too large to be in the dump file.
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`bf0058c0 fffff806`71ec38e7 : 00000a00`00000000 ffff9d01`d36f40c0 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x76
    fffffe8e`bf005a00 fffff806`71ec3459 : ffff9d01`d36f40c0 00000000`00000000 00000000`00000000 ffffb100`62928180 : nt!KiSwapThread+0x297
    fffffe8e`bf005ac0 fffff806`71ec21e0 : 00000000`00000000 fffff806`00000000 fffff80f`00000000 fffffe8e`bf005bd1 : nt!KiCommitThreadWait+0x549
    fffffe8e`bf005b60 fffff80f`2e47b3bf : fffff80f`2e470188 00000000`00000000 00000000`00000000 00000000`00010200 : nt!KeWaitForSingleObject+0x520
    fffffe8e`bf005c30 fffff80f`2e4785b5 : ffff9d01`fc2028e0 00000000`0000003c ffff9d02`1984e6c0 00000000`00010246 : pdc!PdcReleaseLock+0xd3
    fffffe8e`bf005c70 fffff806`71f0715a : ffff9d01`d36f40c0 ffff9d02`1984e748 ffff9d01`2985b100 fffff806`00000000 : pdc!PdcpPreRevocationWorker+0xe5
    fffffe8e`bf005cb0 fffff806`71e793a5 : ffff9d01`d36f40c0 ffff9d01`29880040 ffff9d01`d36f40c0 35000034`00003300 : nt!ExpWorkerThread+0x16a
    fffffe8e`bf005d50 fffff806`71fbef2c : fffff806`70566180 ffff9d01`d36f40c0 fffff806`71e79350 b500000c`b000000c : nt!PspSystemThreadStartup+0x55
    fffffe8e`bf005da0 00000000`00000000 : fffffe8e`bf006000 fffffe8e`bf000000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c

    Essentially KeWaitForSingleObject routine puts the current thread into a wait state until the given dispatcher object is set to a signaled state or (optionally) until the wait times out.

    And if you look closely, the image path shows that it was SYSTEM process which was waiting for the dispatcher's call. There were total 27 such threads which were waiting and caused system to become unresponsive. 

    Suggestion: I think you need to consider upgrading the server hardware resources. Running SQL and DPM on the same server with 2 CPU will not help. In my lab, I have 4Cpu attached and times it also struggles during extensive workload.

    Please upgrade your CPU to at least 6-8 cores and RAM up to 16-24 GB and monitor the DPM server performance afterwards. Keep us posted.

    • Marked as answer by MarkosP Friday, July 17, 2020 9:02 AM
    Tuesday, July 7, 2020 9:08 AM

All replies

  • Hi,

    Is the virtual machine under provisioned? A VM becoming unresponsive usually have these effects when the VM cannot handle the load going on.

    Does it have sufficient vCPUs and RAM? 

    Is it running the SQL Server locally?

    During midnight DPM does its pruning/maintenance job so running any other job at the same time can slow things down and cause high CPU, but the CPU usage is still not that alarming, if it would be stuck on 95-100% then you've got more issues.

    Best regards,
    Leon


    Blog: https://thesystemcenterblog.com LinkedIn:

    • Marked as answer by MarkosP Friday, July 17, 2020 9:03 AM
    Wednesday, June 17, 2020 8:20 AM
  • Not really underprovisioned ... 4 vCPUs, 32GB dynamic vRAM (16GB startup&minimum). Yes, SQL is local.

    As I've said above, there are other DPM VMs running on the same host, with the same HW configuration and those backup lot more datasources.

    This particular DPM was actually created to isolate and better diagnose a problem with old RPs not being removed. So it's only backing up 3 volumes from a target VM using volume protection. DPM should be able to handle that even with lot lower HW configuration.

    Here's CPU load on day when it didn't hang:

    So, nothing horrible, 60-80% CPU usage for about 20mins.


    • Edited by MarkosP Wednesday, June 17, 2020 8:34 AM
    Wednesday, June 17, 2020 8:32 AM
  • Has the local SQL Server been configured with minimum & maximum memory, and other best practices?

    Over-provisioning can also slow down a VM, I've personally never been a fan of dynamic RAM when running a VM with SQL Server (although it is supported).

    You can run a Performance Monitor with CPU, Disk, Memory & Network adapter counters during those times this starts happening to identify more what's going on. Or run it for like for 24 hours starting at noon, so you get the whole night what's going on.

    https://www.windowscentral.com/how-use-performance-monitor-windows-10
    (Applies to Windows Server as well)

    The RDP disconnecting could also be when the network throughput is so high that the VM's network adapter cannot handle it, this I have experienced before.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, June 17, 2020 8:44 AM
  • Hello MarkoSp,

    Is it only happening at around 12:00am or at least around that time?

    In order to troubleshoot we need to isolate the issue first. We need to isolate if DPM is causing this issue or if its something related to OS itself. So we can try disabling the pruning job (DPM GC JOB) for the time being and see if issue reoccurs.

    Use below SQL query to identify the ScheduleID

    SELECT SCH.ScheduleId, SCH.JobDefinitionId, jd.CreationTime
    FROM tbl_JM_JobDefinition JD
    JOIN tbl_SCH_ScheduleDefinition SCH
    ON JD.JobDefinitionId = SCH.JobDefinitionId
    WHERE JD.Type = '282faac6-e3cb-4015-8c6d-4276fcca11d4'
    AND JD.IsDeleted = 0
    AND SCH.IsDeleted = 0

    Once you have the scheduleID, you can disable the JOB from Sql server Agent (SQL UI) for at least 4-5 days and keep the under monitoring.

    If it turns out to be DPM pruning, we can enable verbose logs for DPM and  see whats happening in the back-end along with some other data.

    Also please run below sql SP to see if SQL has reported any IO issue or errors during that time frame.

    sp_readerrorlog

    NOTE: It looks like your DPM server is for testing and doesn't have too many Datasources for protection, hence i have suggested to disable the Pruning for next few days. Otherwise it is not recommended. 



    Wednesday, June 17, 2020 6:32 PM
  • Hi Aayoosh,

    last night was one of those days when it didn't hang, nevertheless I've disabled the pruning job per the instructions. We'll see if it has any effect.

    It's not really a test server, but disabling the pruning for couple days for diagnostic purposes is fine.

    Leon, I haven't really done any SQL optimizations to be honest (not even min/max memory, we do that on heavy duty SQL servers though, which I haven't considered DPM SQL to be until now), all we usually do is reconfigure autogrowth parameters to avoid too much disk fragmentation. I would understand this could be some contention issue on a very busy server, but I find that hard to believe as it's the least loaded of all the DPM servers we have and they're all configured the same way.

    I'll set up a a perfmon trace to see if it can reveal something more than our regular monitoring system.

    Edit: There's also SQL backup maint. task running at midnight (backup of the DPM DB), so there's 3 things starting at midnight: pruning, SQL backup, DPM backup. If disabling pruning will help, I'll try moving the DPM DB backup to a later time.
    • Edited by MarkosP Thursday, June 18, 2020 3:50 PM
    Thursday, June 18, 2020 3:46 PM
  • Lets us know how everything goes, you can also have a look at the SQL logs during midnight to check if there's anything else going on.

    Blog: https://thesystemcenterblog.com LinkedIn:

    Thursday, June 18, 2020 3:52 PM
  • Hi MarkosP,

    Thanks for the update,

    Well, running DPMDB backup with Pruning job isnt really a good idea, so disabling the pruning job for few days may help us understand the root cause here.

    However,  I would like to get some more data before the next update  :

    what is the DPMDB size currently? and what is the fragmentation on the DPMDB?

    https://techcommunity.microsoft.com/t5/system-center-blog/dpmdb-maintenance-part-2-identifying-and-dealing-with/ba-p/350453

    use above link to check the DPM DB fragmentation %.

    Also, we can mimic the scenario right now by taking DPMDB backup using sql query or using your maintenance job and run the pruning job (using the scheduleID we figured above) at the same time and see if VM hangs again.

    Thursday, June 18, 2020 4:16 PM
  • No hangs since I've disabled the pruning job, so I've reenabled it again and moved backup job to run half hour later (at 0:30).

    DPM DB is very small ~600MB. Fragmentation is also low, since we run weekly index reorg task.

    I'll let you know how it goes.

    Monday, June 22, 2020 6:34 AM
  • Let it be enabled during the same time when it was disabled (same days), so it's easier to identify what's the root cause, have the performance monitor running during this time as well.

    Personally I've never witnessed the pruning jobs to be causing issues, but it might also just be a coincidence.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, June 22, 2020 6:47 AM
  • Eh, it must be the pruning job, because the VM just hanged ... when I wrote before that I reenabled the pruning job, I've also manually ran it via SQL agent .... couple minutes later and here we go again.
    Monday, June 22, 2020 6:53 AM
  • Since the affected DPM server is small, I'd try a clean install and check if the issue occurs again.

    If the problem occurs again and you are running everything on a supported platform, I'd raise a ticket to Microsoft about the issue.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Monday, June 22, 2020 7:28 AM
  • Hi MarkoSp,

    if you are willing to troubleshoot on forum and try few more things, we can try our best but it will take some time.

    Now since we know the cause of this issue, we can isolate it a bit further: please help me answer below queries:

    1. Please run Sp_readerrolog on the SQL mgmt studio. I want to isolate that we are not getting any bottleneck on the SQL side before we explore other aspect of the OS/DPM. Run the SP and see if there are any complaints during the pruning job.

    2. Does DPM server has an AV installed? If so, have you added the required exclusions for the DPM services/path etc. Please share fltmc outout of the DPM server, I would like to see the altitude for all the filter driver installed on the DPM server. 

    3. Run DiskSPD tool against the DPM Storage pool and share the output. When it runs, keep a close eye on the VM performance. Also run it against the SQL Drive. We need to see performance for both the drives individually. 

    Use below command to get the disk IOPS:

    diskspd.exe -b64K -d300 -L -o8 -t4 -r -w70 -c500M D:\test\Tes2.dat  ---àMention the correct Drive Letter

    Now, since you have already ran the pruning job manually, I am assuming that it may have removed the eligible RPs, So now could you again run the Pruning job again and see if affects the VM? try it for few times. 

    if above doesn't help us much, I would rather be interested to see the MSDPM verbose logs during the pruning operation.


    Monday, June 22, 2020 7:50 PM
  • Hi.

    1. There are only messages about the DPM DB being backed up according to the maint. plan

    2. Yes, the default Windows Defender, but with real-time protection disabled (on all our DPM servers) as it caused severe issues with backups in the past (despite having all the appropriate exclusions in place)

    3. All DPM VMs share the same local storage configured as tiered storage space (regular, not S2D). The SQL drive is on the same storage, only pinned to the SSD tier.

    Also, when I reran the pruning job, it finished in about 20s, but then there was the load on the system (caused by the 'System' process. Ιt removed some of the RPs, then hanged. After reboot, I ran the job again, this time it finished, removing remaining eligible RPs.

    For reference, here's a screenshot of the system process load when removing RPs (I assume this is normal):


    Link to the diskspd results for the SQL drive: https://pastebin.com/AUeP9YZ8

    Link to the diskspd results for the DPM storage drive: https://pastebin.com/gHQFABXG



    • Edited by MarkosP Tuesday, June 23, 2020 8:27 AM
    Tuesday, June 23, 2020 8:27 AM
  • The disks appear to be performing OK, so the issue is likely not there.

    I would check the SQL best practices to see if there are any fixes/configurations that can be applied, otherwise I'd try reinstalling fresh and apply every best practice there is before starting to deploy DPM and then monitor the situation.


    Blog: https://thesystemcenterblog.com LinkedIn:

    Tuesday, June 23, 2020 8:55 AM
  • Hi MarkoSp,

    Thanks for the update,

    In the above screenshot I am seeing CcCopyReadEx function being called 4 times with high CPU load. The CcCopyReadEx routine copies data from a cached file to a user buffer and that's where it could be causing a huge performance lag on the server. Along with it, I can see that REFS.sys stack is also loaded so it makes sense that the bottleneck could be coming from the REFS IO operation with respect to REFS extents (clone blocks).

    What is the OS build you are using for 2019? Make sure it is fully patched. Along with it, also look for REFS events in the event viewer (Application -> Services Logs->Microsoft-> Windows -> REFS).

    I would also like to get a ProcMon Traces when the issue occurs, try running Procmon and don't use Virtual Memory to capture the data instead select the Local volume where Procmon will save all its traces.
    Click on File -> Backing Files -> Select a local folder. Collect the traces and share it if possible.

    Hopefully we will get something before the VM completely freezes and can see what's been happening in the OS.

    Tuesday, June 23, 2020 9:33 AM
  • Hi.

    the CcCopyReadEx is called many times during that process, this was just a screenshot to show what's going on.

    Also, here's the output of fltmc which I forgot to post earlier:

    Filter Name                     Num Instances    Altitude    Frame
    ------------------------------  -------------  ------------  -----
    FsDepends                               6       407000         0
    DpmFilter                               0       385300         0
    storqosflt                              0       244000         0
    wcifs                                   0       189900         0
    CldFlt                                  0       180451         0
    Dedup                                   3       180450         0
    FileCrypt                               0       141100         0
    luafv                                   1       135000         0
    npsvctrig                               1        46000         0
    Wof                                     1        40700         0

    OS is WS2019 Std 1809 build 17763 (LTSC) with 2020-05 CU installed.

    There are no errors in the ReFS eventlog, only information summary with IO latencies and no event around the midnight when pruning occurs.

    VM has been runing fine since the last reboot until last night when it hanged again (even with the backup job moved).

    I'll try to get a procmon trace.


    • Edited by MarkosP Thursday, June 25, 2020 2:00 PM
    Thursday, June 25, 2020 1:59 PM
  • CcCopyReadEx is expected to be called many times during purge as it needs to copy data from a cached file to a user buffer and when there are multiple instances of it running at the same time, it chokes the OS and causes hang.

    The screenshot you posted, it had 3 high utilized CcCopyReadEx process and other were sitting idle. But during purge, I believe each instance of it uses its own resources and causes a huge bottleneck. It is more of a REFS issue that I am suspecting now and DPM is just a victim here.

    Could you share the REFS.sys driver version from your working and non working DPM server? Are they on the same version or different?

    Thursday, June 25, 2020 2:53 PM
  • refs.sys version is: 10.0.17763.1192. On the other DPM servers it is 10.0.17763.1075, but that will be because this server is newer than the others and they didn't go through the patch maintenance yet.

    ReFS parameters are "optimized" on all DPM servers per https://charbelnemnom.com/2018/08/how-to-optimize-refs-performance-with-system-center-data-protection-manager-windowsserver-refs-scdpm-dpm/
    Thursday, June 25, 2020 6:55 PM
  • Okay - Yes, I was expecting a different version but surprised to see that non-working DPM server has the latest version. I am now quite certain that its related to some REFS code. As a recommendation, do not update your OS on working DPM servers, at least for now. 

    Those REFs keys wont help much as those came as a fix in windows server 2019. 

    Now I would like to get some more data to look into the Memory:

    1.) Lets try to get the memory dump for the VM when it is in HUNG state. When the VM is HUNG, on the host run below command to generate a guest Memory dump:

    Debug-vm "DPM VM Name" -InjectNonMaskableInterrupt -Force

    Make sure GUEST is configured to generate full memory dump. Once dump is generated, try to send it over and let me see if we can understand the REFS behavior.

    2.) ProcMon would be helpful as well. Try to get procmon after running manual prune job, even if VM doesn't hang, Procmon will still be helpful.   

    *******************************************************

    While we try to collect the data and then spend some time analyzing it, in the interim we could also try few more things to isolate it even further. See if you can try below steps to generate some more information.

    Export the working DPM from Hyper-V(where pruning job not causing a hang).

    Import the same VM on hyper-v and then disconnect it from the network.

    Boot up and manually install the latest KB so that its REFS version will be same as non working DPM server i.e. 10.0.17763.1192.

    Once done, Trigger a manual pruning job from SQL and see if we can reproduce the issue. 

    Let me know if you are up for it. Thanks. 


    Thursday, June 25, 2020 7:59 PM
  • Hi.

    re 1) the VM hanged one time, but I forgot to configure full memory dump, so I'm now waiting for another occurence

    re 2) I've collected a trace, what exactly should I look for in it?

    I've also done as you suggested - exported one of the working DPM VMs, imported it back, disconnected from the network, manually installed latest CU, which brought refs.sys version to 1192 and ran pruning job from SQL several times. There was no increase in CPU load, also not for the System process. I've let the VM run over night, the backup jobs obviously failed, tried running the pruning job again several times, same behavior.

    Wednesday, July 1, 2020 11:14 AM
  • Would it be possible to share the ProcMon trace ? Without seeing the traces, it is difficult to say from where we can start. See if you can upload the ProcMon traces.

    The new VM didn't exhibit the same issue probably because it didn't have much RPs to prune. If you have exported the working VM after the Prune job completed for that server, then it wont help much because by the time you import it and re-run the pruning, nothing will be there to prune and hence not much Disk activity will be happening. And of course running the pruning job again and again wont help either as new RPs are not being created.

    However you could try changing the Retention range for all the PGs and bring it down to a significant number. once done, run Pruning using SQL schedule id and then see how is the utilization for the server.

    For example: If you have PGs where avg retention is 10 days, I would change it to 5-6 days and then let pruning job take care of the rest.

    NOTE: All the changes to be done on the EXPORTED VM only. Just making myself clear :)


    Wednesday, July 1, 2020 12:11 PM
  • Hello,

    we have the same probleme since two weeks.

    We have two VMs  (one DPM VM, on SQL 17 VM) running on a 2019 HYPERV (no others VMs running).
    All Server are Windows 2019 latest KBs installed.

    Short after midnight (00:30), DPM 2019 VM becomes unresponsive.
    Not reachable via RDP or HYPERV Console. On the HYPERV i see high CPU on the DPM VM, so something is running taking all resources.

    At this time no PGs are scheduled (Next one beginns at 2 oclock in the morning).

    After rebooting the machine, all PGs are running without problems, until midnight next day.

    This problem beginns since 2 weeks. Before (machine running since nearly 2 months), this problem never occurs.


    Thursday, July 2, 2020 4:42 AM
  • Try to stop the pruning job using schedule ID and see if it helps and also share the refs.sys version. It could be related to the refs. Try to capture process explorer and procmon too during pruning. 

    Thursday, July 2, 2020 5:48 AM
  • Aayoosh, is there any private workplace to share the trace with you? I'm not really keen to post it publicly on internet.

    I've tried shortening the retention period, but it doesn't seem to have any effect - probably because it can't contact the network/protected agents (but I guess that should have no effect on pruning)? I've repeatedly ran the pruning job, for good measure even restarted the server, but DPM UI still reports the original number or RPs for each DS. I've also verified this using Powershell in case there's some UI bug.

    Thursday, July 2, 2020 9:56 AM
  • connect me via linkedin, we will figure out a way to share the logs.

    Have you tried running pruneshadowcopies script manually? If yes, Did it help? Take a checkpoint of the VM (Production Checkpoint) and run that script and see if it removes the RPs.

    Thursday, July 2, 2020 10:35 AM
  • pruneshadowcopiesdpm.ps1 fails because it can't make a connection to AD:

    connect-dpmserver : The server <redacted> could not be found in Active Directory. (ID: 31144)
    Ping this server to verify that it is accessible and then retry the operation. If the server is accessible, verify that
     the DPM service is running and that the current user has a domain account and is a member of the local administrators group. If the problem persists, contact your DPM administrator.

    I have a full memory dump of the problematic VM and I've sent you an invitation on LinkedIn.

    Friday, July 3, 2020 9:16 AM
  • Hi,

    I looked at the ProcMon traces and I didn't see any sort of bottleneck there. I also didn't see any CPU high usage either. I noticed that the DPM had 2CPU and only 8gb RAM. I guess you have shared the PML trace from some other DPM server.

    Please re-upload the PML trace from the non working DPM server during the pruning job along with Full Memory DUMP.

    Friday, July 3, 2020 12:11 PM
  • Hi Markos,

    Thanks for uploading the Memory DUMP. Below is my analysis:

    I checked for certain types of locks in the dump like spinlock, qlocks, deadlocks etc but I found none, So I didnt see any abnormalities in that aspect. However below is my analysis which shows why the server went unresponsive. 

    First I checked the CPU configuration and noticed that DPM server has only 2 cores:

    1: kd> !cpuinfo
    CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features
     0  6,85,4 GenuineIntel 3492 ffffffff00000000                   211b3dfe
     1  6,85,4 GenuineIntel 3492 ffffffff00000000                   211b3dfe
                          Cached Update Signature ffffffff00000000
                         Initial Update Signature ffffffff00000000


    After that I looked for the top 5 processes with highest utilization: 

     Pid ImageName                        Commit   SharedCommit        Debt
    
      ed4 sqlservr.exe                  382856 Kb        2640 Kb        0 Kb
     1174 ReportingServicesService      206076 Kb        2308 Kb        0 Kb
     1410 RSPortal.exe                  184008 Kb        2580 Kb        0 Kb
      ef8 RSHostingService.exe          136896 Kb        2368 Kb        0 Kb
     1130 msdpm.exe                     133196 Kb        2268 Kb        0 Kb


    After that I looked for active locks currently help in the memory

    0: kd> !locks
    **** DUMP OF ALL RESOURCE OBJECTS ****
    KD: Scanning for held locks...
    
    Resource @ 0xffff9d01dbce3cc8    Shared 1 owning threads
         Threads: ffff9d01ded05040-01<*> 
    280727 total locks, 27 locks currently held


    I noticed there were many threads which were at lock state. I looked at few and below is the example:

    Thread No 1:

    !thread ffff9d01ded05040
    
    0: kd> !thread ffff9d01ded05040
    THREAD ffff9d01ded05040  Cid 0004.10a4  Teb: 0000000000000000 Win32Thread: 0000000000000000 READY on processor 80000001
    IRP List:
        ffff9d01d5921bb0: (0006,03a0) Flags: 00060043  Mdl: ffff9d0130cee390
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      16602307       Ticks: 1185 (0:00:00:18.515)
    Context Switch Count      122321         IdealProcessor: 1             
    UserTime                  00:00:00.000
    KernelTime                00:06:58.281
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ebff80dd0 Current fffffe8ebff7e1d0
    Base fffffe8ebff81000 Limit fffffe8ebff7b000 Call 0000000000000000
    Priority 15 BasePriority 13 PriorityDecrement 32 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`bff7e210 fffff806`71ec38e7 : ffff9d01`ded05040 ffff9d01`ded05040 00000000`0000600a fffff806`00000005 : nt!KiSwapContext+0x76
    fffffe8e`bff7e350 fffff806`71ec3459 : ffff9d01`ded05040 00000000`00000000 ffff9d01`2c3cbae0 8a000001`31aa9863 : nt!KiSwapThread+0x297
    fffffe8e`bff7e410 fffff806`71ec21e0 : fffffe8e`bff7e800 00000000`00000000 00000000`00000000 fffffe8e`bff7e521 : nt!KiCommitThreadWait+0x549
    fffffe8e`bff7e4b0 fffff806`74e04580 : fffffe8e`bff7e608 fffffe8e`00000000 00000000`cd1f4000 00000000`00000000 : nt!KeWaitForSingleObject+0x520
    fffffe8e`bff7e580 fffff806`74dddc6f : 00000000`00000000 00000000`00000000 ffff9d01`2edfea70 ffff9d01`00000103 : ReFS!RefsMinstoreRead+0x194
    fffffe8e`bff7e690 fffff806`74ddc170 : fffffe8e`bff7e870 ffff9d01`d5921bb0 ffff8587`b6af4001 00000000`00000000 : ReFS!RefsCommonRead+0x16df
    fffffe8e`bff7e850 fffff806`71f093c9 : ffff9d01`d5921bb0 fffff806`72039db0 ffff9d01`30cee2d0 ffff9d01`30cee390 : ReFS!RefsFsdRead+0x440
    fffffe8e`bff7eb70 fffff806`71e157c0 : ffff9d01`2f132030 ffff9d01`d5921bb0 ffff9d01`30cee2d0 ffff9d01`30cee390 : nt!IofCallDriver+0x59
    fffffe8e`bff7ebb0 fffff806`71e686e9 : ffff9d01`30cee280 ffff9d01`30cee2a0 ffff9d01`30cee2e0 ffff9d01`30cee2d0 : nt!IoPageReadEx+0x188
    fffffe8e`bff7ec00 fffff806`71e99d5d : 00000000`00000002 fffffe8e`bff7ec90 fffffe8e`bff7edf8 fffffe7f`2161eda8 : nt!MiIssueHardFaultIo+0xc1
    fffffe8e`bff7ec50 fffff806`71eb0dbf : ffff8000`00000000 00000000`c0033333 ffff8587`b6af4000 ffff8587`b6af4000 : nt!MiIssueHardFault+0x3ed
    fffffe8e`bff7ed00 fffff806`71e090fc : fffffe7f`2161eda8 fffff806`7223e6c0 fffffe7f`2161eda8 ffff8587`b6af4000 : nt!MmAccessFault+0x32f
    fffffe8e`bff7eea0 fffff806`71eaed3f : 00000000`00000001 ffff8587`b6af4000 fffffe42`c3db57a0 fffffe7f`2161eda8 : nt!MiFaultInProbeAddress+0xa4
    fffffe8e`bff7ef50 fffff806`71eadd7e : fffffe8e`bff7f080 fffff806`7223b7c0 fffffe8e`bff7f0c0 00000000`00000000 : nt!MiLockPageLeafPageTable+0x27f
    fffffe8e`bff7efc0 fffff806`71ead099 : ffff9d01`00000000 00000000`00000000 00000000`00000000 ffff9d01`dbce3b00 : nt!MiProbeAndLockPages+0x39e
    fffffe8e`bff7f110 fffff806`74df9992 : fffffe8e`bff7f700 ffff9d01`2f24a000 ffffc505`ce1e2100 00000000`00000000 : nt!MmProbeAndLockPages+0x29
    fffffe8e`bff7f140 fffff806`74dd72b4 : ffff9d01`dbce3b28 00000000`00000000 ffff9d01`dbce3d40 ffff9d02`1984e110 : ReFS!MsKmeLockPages+0x52
    fffffe8e`bff7f190 fffff806`74dd7132 : fffffe8e`bff7f7a0 00000000`00000000 fffffe8e`bff7f700 fffffe8e`bff7f701 : ReFS!CmsVolume::MakePageResident+0xb4
    fffffe8e`bff7f3e0 fffff806`74dd6ff0 : 00000000`00000001 ffff9d01`73aea008 fffff806`74dd6fe0 fffffe8e`bff7f7a0 : ReFS!CmsVolume::ReadPageWorker+0x132
    fffffe8e`bff7f570 fffff806`71eca9f8 : fffffe8e`bff7f750 00000000`00000000 fffffe8e`bff7f780 fffffe8e`bff7fa38 : ReFS!CmsVolume::ReadPageCallout+0x10
    fffffe8e`bff7f5a0 fffff806`71eca96d : fffff806`74dd6fe0 fffffe8e`bff7f7a0 00000000`00000000 00000000`00000000 : nt!KeExpandKernelStackAndCalloutInternal+0x78
    fffffe8e`bff7f610 fffff806`74db709b : ffff9d02`1ee00000 00000000`001991f4 ffff9d01`00000000 00000000`00000240 : nt!KeExpandKernelStackAndCalloutEx+0x1d
    fffffe8e`bff7f650 fffff806`74db59cf : 00000000`00000000 ffff9d01`2f24dc30 ffff7949`1513c850 00000000`00000000 : ReFS!CmsVolume::PinSinglePageWorker+0x112b
    fffffe8e`bff7f870 fffff806`74dc7741 : 00000000`0000002e 00000000`00000000 00000000`00000001 00000000`00000000 : ReFS!CmsVolume::PinCommon+0x14f
    fffffe8e`bff7fb60 fffff806`74dc7114 : ffff8588`09e00058 fffffe8e`bff7fd20 ffff9d01`2c6b9920 ffff8588`09e00ad0 : ReFS!CmsCompositeBase::PinChild+0xb1
    fffffe8e`bff7fc20 fffff806`74da547b : ffff8588`09e04058 fffffe8e`bff80040 ffff9d01`73aea008 00000000`00000000 : ReFS!CmsBPlusTable::AttemptSiblingMerge+0x300
    fffffe8e`bff7ff40 fffff806`74ded107 : ffff9d01`2c6b9920 fffffe8e`bff80299 ffff9d01`b7f35b68 ffff9d01`2c6b9920 : ReFS!CmsBPlusTable::TryToCollapseFromPage+0x53f
    fffffe8e`bff80200 fffff806`74de98fc : 00000000`00000000 00000000`00000000 ffff9d01`2c6b9920 ffff9d01`2c6b9a80 : ReFS!CmsBPlusTable::TryToCollapseTree+0xef
    fffffe8e`bff80300 fffff806`74de95c2 : 00000000`00000000 00000000`00000000 ffff9d01`2c6b9920 fffff806`71f632be : ReFS!CmsBPlusTable::FailableTreeUpdate+0x160
    fffffe8e`bff80370 fffff806`74de6f69 : ffff9d01`00000001 ffff9d01`2f46b920 fffffe8e`bff80550 ffff9d01`2f46b920 : ReFS!CmsBPlusTable::RunFailableTreeUpdateForAllDirtyTrees+0x2a2
    fffffe8e`bff80450 fffff806`74debdce : ffff9d01`73aea008 fffffe8e`bff809d0 ffff9d01`2f24a900 ffff9d01`73aea008 : ReFS!CmsBPlusTable::UpdateBoundTrees+0xf29
    fffffe8e`bff808d0 fffff806`74df0a9b : ffff9d01`2f24a010 ffff9d01`2f24a940 ffff9d01`2f24a930 fffff806`72474088 : ReFS!CmsVolume::Checkpoint+0x622
    fffffe8e`bff80bc0 fffff806`74dfb937 : 00000000`00000000 fffff806`7235d240 00000000`00000000 00000000`00000000 : ReFS!MspCheckpointVolume+0x3b
    fffffe8e`bff80c60 fffff806`71f0715a : ffff9d01`2984e950 ffff9d01`ded05040 fffff806`72238000 00000000`00000000 : ReFS!MspWorkerRoutine+0x47
    fffffe8e`bff80cb0 fffff806`71e793a5 : ffff9d01`ded05040 ffff9d01`29880040 ffff9d01`ded05040 00000000`00000000 : nt!ExpWorkerThread+0x16a
    fffffe8e`bff80d50 fffff806`71fbef2c : ffffb100`62928180 ffff9d01`ded05040 fffff806`71e79350 00000000`00000246 : nt!PspSystemThreadStartup+0x55
    fffffe8e`bff80da0 00000000`00000000 : fffffe8e`bff81000 fffffe8e`bff7b000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c
    

    Thread 2:

    0: kd> !thread ffff9d01320a4040
    THREAD ffff9d01320a4040  Cid 0004.0b14  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
        ffff9d012f24a580  NotificationEvent
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      9432095        Ticks: 7171397 (1:07:07:33.078)
    Context Switch Count      102174         IdealProcessor: 0             
    UserTime                  00:00:00.000
    KernelTime                00:00:52.125
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ec008bdd0 Current fffffe8ec008b270
    Base fffffe8ec008c000 Limit fffffe8ec0086000 Call 0000000000000000
    Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`c008b2b0 fffff806`71ec38e7 : 00000000`00000000 ffff9d01`320a4040 ffff9d01`00000002 fffff806`7235b100 : nt!KiSwapContext+0x76
    fffffe8e`c008b3f0 fffff806`71ec3459 : ffff9d01`320a4040 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSwapThread+0x297
    fffffe8e`c008b4b0 fffff806`71ec21e0 : 00000000`00000000 00000000`00000000 00000000`00000000 fffffe8e`c008b5c1 : nt!KiCommitThreadWait+0x549
    fffffe8e`c008b550 fffff806`74de62a0 : ffff9d01`2f24a580 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KeWaitForSingleObject+0x520
    fffffe8e`c008b620 fffff806`74de9d94 : ffff9d01`320a4140 00000000`00000000 00000000`00000000 00000000`00000000 : ReFS!CmsBPlusTable::UpdateBoundTrees+0x260
    fffffe8e`c008baa0 fffff806`74dd379a : 00000000`00000001 fffff806`74dfb8f0 ffff9d01`2f24a010 fffff806`74dded00 : ReFS!CmsBPlusTable::UpdateTree+0x64
    fffffe8e`c008bb00 fffff806`74dd330a : 00000000`00000001 ffff9d01`2f24a010 ffff9d01`2f497ac8 fffff806`71ec3459 : ReFS!MsUpdateTree+0x32
    fffffe8e`c008bb60 fffff806`74dfb937 : ffff9d01`9c0d1050 fffff806`00000000 ffff9d01`00000000 00000000`00000000 : ReFS!MspLazyWriterWorker+0x21a
    fffffe8e`c008bc60 fffff806`71f0715a : ffff9d01`2984e950 ffff9d01`320a4040 fffff806`7223e300 ffff9d01`00000000 : ReFS!MspWorkerRoutine+0x47
    fffffe8e`c008bcb0 fffff806`71e793a5 : ffff9d01`320a4040 ffff9d01`29880040 ffff9d01`320a4040 00000000`00000000 : nt!ExpWorkerThread+0x16a
    fffffe8e`c008bd50 fffff806`71fbef2c : ffffb100`62928180 ffff9d01`320a4040 fffff806`71e79350 00000000`00000246 : nt!PspSystemThreadStartup+0x55
    fffffe8e`c008bda0 00000000`00000000 : fffffe8e`c008c000 fffffe8e`c0086000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c

    Thread 3:

    0: kd> !thread ffff9d01d36f40c0
    THREAD ffff9d01d36f40c0  Cid 0004.1a60  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable
        fffff80f2e470188  NotificationEvent
    Not impersonating
    DeviceMap                 ffffc505c8a135a0
    Owning Process            ffff9d0129880040       Image:         System
    Attached Process          N/A            Image:         N/A
    Wait Start TickCount      11506897       Ticks: 5096595 (0:22:07:14.296)
    Context Switch Count      51552          IdealProcessor: 0             
    UserTime                  00:00:00.000
    KernelTime                00:00:41.812
    Win32 Start Address nt!ExpWorkerThread (0xfffff80671f06ff0)
    Stack Init fffffe8ebf005dd0 Current fffffe8ebf005880
    Base fffffe8ebf006000 Limit fffffe8ebf000000 Call 0000000000000000
    Priority 15 BasePriority 12 PriorityDecrement 32 IoPriority 2 PagePriority 5
    Page 80031ea3 too large to be in the dump file.
    Page 80031ea3 too large to be in the dump file.
    Page 800a91fe too large to be in the dump file.
    Page 800a91fe too large to be in the dump file.
    Child-SP          RetAddr           : Args to Child                                                           : Call Site
    fffffe8e`bf0058c0 fffff806`71ec38e7 : 00000a00`00000000 ffff9d01`d36f40c0 00000000`00000000 00000000`00000000 : nt!KiSwapContext+0x76
    fffffe8e`bf005a00 fffff806`71ec3459 : ffff9d01`d36f40c0 00000000`00000000 00000000`00000000 ffffb100`62928180 : nt!KiSwapThread+0x297
    fffffe8e`bf005ac0 fffff806`71ec21e0 : 00000000`00000000 fffff806`00000000 fffff80f`00000000 fffffe8e`bf005bd1 : nt!KiCommitThreadWait+0x549
    fffffe8e`bf005b60 fffff80f`2e47b3bf : fffff80f`2e470188 00000000`00000000 00000000`00000000 00000000`00010200 : nt!KeWaitForSingleObject+0x520
    fffffe8e`bf005c30 fffff80f`2e4785b5 : ffff9d01`fc2028e0 00000000`0000003c ffff9d02`1984e6c0 00000000`00010246 : pdc!PdcReleaseLock+0xd3
    fffffe8e`bf005c70 fffff806`71f0715a : ffff9d01`d36f40c0 ffff9d02`1984e748 ffff9d01`2985b100 fffff806`00000000 : pdc!PdcpPreRevocationWorker+0xe5
    fffffe8e`bf005cb0 fffff806`71e793a5 : ffff9d01`d36f40c0 ffff9d01`29880040 ffff9d01`d36f40c0 35000034`00003300 : nt!ExpWorkerThread+0x16a
    fffffe8e`bf005d50 fffff806`71fbef2c : fffff806`70566180 ffff9d01`d36f40c0 fffff806`71e79350 b500000c`b000000c : nt!PspSystemThreadStartup+0x55
    fffffe8e`bf005da0 00000000`00000000 : fffffe8e`bf006000 fffffe8e`bf000000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x1c

    Essentially KeWaitForSingleObject routine puts the current thread into a wait state until the given dispatcher object is set to a signaled state or (optionally) until the wait times out.

    And if you look closely, the image path shows that it was SYSTEM process which was waiting for the dispatcher's call. There were total 27 such threads which were waiting and caused system to become unresponsive. 

    Suggestion: I think you need to consider upgrading the server hardware resources. Running SQL and DPM on the same server with 2 CPU will not help. In my lab, I have 4Cpu attached and times it also struggles during extensive workload.

    Please upgrade your CPU to at least 6-8 cores and RAM up to 16-24 GB and monitor the DPM server performance afterwards. Keep us posted.

    • Marked as answer by MarkosP Friday, July 17, 2020 9:02 AM
    Tuesday, July 7, 2020 9:08 AM
  • Hi.

    First, to clarify - the HW specs I wrote above were incorrect, it was 2 vCPUs, 8-16GB dynamic vRAM. The idea was the use lower specs than the other DPM servers, since this DPM only backs up 1 file server (3 volumes), while the others each back up tens to hundreds of datasources (and also much more data in total).

    Aayoosh, thanks for the analysis. I have increased the specs to 4 vCPUs and 12-16GB vRAM, however as I wrote to you in our private conversation, even with the lower specs, the expected behavior would be slower job execution, not server hangs. Sadly, even the official HW requirements for DPM state nothing about CPUs (there was a 4 CPU recommendation for DPM 2012, but not for newer versions). If this proves to be the culprit, it's just one more salty point with DPM.

    Tuesday, July 7, 2020 9:30 AM
  • Hello MarkosP!

     We have/had exactly the same issue: the DPM VM hangs everyday between 00:00-00:20. The OS is still running but completely unresponsive: icmp traffic is being replied but RDP sessions are dropped, PowerShell remoting dont work, VM console is black or showing the lock screen clock several hours late..... The Hyper-V console permanently displays 24% CPU usage when normally is 0%.

     The Hyper-V host is running two DPM virtual machines. The small one is fine (10TB repo), it is failing the bigger DPM (55TB).

     There are some events in the eventlog pointing to system resources issues and the Hyper-V console displays high memory demand, some GB over the memory assigned to the guest.

     REFS registry fixs were already applied(*) so yesterday we assigned additional memory(64GB->128GB). This night the VM didn't hangs, but we are not confident on this solution because we had doubled already the memory last Thursday (32GB->64GB). Then the issue was fixed only for that night.

    (*) https://support.microsoft.com/en-us/help/4016173/fix-heavy-memory-usage-in-refs-on-windows

    Sunday, July 12, 2020 9:50 AM
  • So far, so good - it's been 6 days now without hang. There were longer periods without a hang in the past, but it looks like increasing HW specs and moving the backup job to different time might have helped.

    As I've said above, if this is truly the reason for the hangs, then it's sad a "enterprise" software like DPM can't handle this better.

    As for your issue xBusgosu, you didn't mention how many CPUs your DPM VMs have, but it might be worth trying to increase the count.

    I'll wait few more days and then close this thread.
    • Edited by MarkosP Wednesday, July 15, 2020 10:52 AM
    Wednesday, July 15, 2020 10:50 AM
  • Most slowness is usually caused due to underprovisioning (as I mentioned in my initial reply).

    If you do have any feedback, make sure to submit it over at the DPM uservoice page here:
    https://feedback.azure.com/forums/914488-system-center-data-protection-manager


    Blog: https://thesystemcenterblog.com LinkedIn:

    Wednesday, July 15, 2020 10:53 AM
  • Hi Markos,

    Glad it has not hung in last few days,

    keep in mind that you might need to increase the resources of DPM server in future to keep up with the CPU and Memory requirement during pruning and backups. REFs is resource hungry, so if you keep adding new servers to the protection, it will increase the number of RPs to be prune during mid night so you need to find the sweet spot for CPU and Memory requirement for the DPM server.

    Wednesday, July 15, 2020 10:57 AM
  • I consider the problem to be resolved (it's been over 7 days without a hang now).

    Thank you both Leon and Aayoosh for your help.

    Friday, July 17, 2020 9:02 AM