none
Nightly bluescreens on DPM server caused by ReFS RRS feed

  • Question

  • I'm not sure if this is the forum to place this in, but on one of our DPM servers we are receiving nightly blue screens due to ReFS.sys.  It happens between 2-10 times during the night, and when the mini dumps are analyzed they all come back similar to this:

    On Tue 7/18/2017 12:27:47 AM your computer crashed
    crash dump file: C:\Windows\Minidump\071817-18296-01.dmp
    uptime: 07:47:05
    This was probably caused by the following module: refs.sys (ReFS!FsLibCleanupPeriodicPerfData+0xD8A8) 
    Bugcheck code: 0xC2 (0xB, 0xFFFFCB0822945C40, 0x53C0203, 0xFFFFCB0822945C80)
    Error: BAD_POOL_CALLER
    file path: C:\Windows\system32\drivers\refs.sys
    product: Microsoft® Windows® Operating System
    company: Microsoft Corporation
    description: NT ReFS FS Driver
    Bug check description: This indicates that the current thread is making a bad pool request.
    This appears to be a typical software driver bug and is not likely to be caused by a hardware problem. 
    The crash took place in a standard Microsoft module. Your system configuration may be incorrect. Possibly this problem is caused by another driver on your system that cannot be identified at this time. 

    This server is a VMware VM running on ESXi 6.5, and has fully updated vmware tools.  Is anyone else running across this issue?  I've never heard of ReFS causing bluescreens before.

    Tuesday, July 18, 2017 3:34 PM

All replies

  • After a bit more research this seems to be something that's an issue that another, very popular, backup software has with ReFS as well.  From what I can tell it's a problem with ReFS that the ReFS team is investigating.  From what I've read it seems to be a lot more common with 4k formatted volumes than with 64k.

    From the forum that I was reading there is a patch coming out in August or September that is in the process of being backported and tested.

    • Edited by JN1226 Friday, July 21, 2017 7:55 PM
    Friday, July 21, 2017 7:02 PM
  • I opened a support case with Microsoft and they've stated that the bug check that we submitted was an issue they were seeing with DPM and could (possibly?) be alleviated by turning off windows telemetry on the server.  I followed the instructions below (it's a reg key and disabling a service, pretty simple) and the problematic server managed to survive the entire night without throwing a SCOM error.  Hopefully this will continue, as there are occasional instances of the server not crashing.

    To disable Telemetry and Data Collection, you need to do the following:<o:p></o:p>

    Open Registry Editor.<o:p></o:p>

    Go to the following Registry key:<o:p></o:p>

    HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\DataCollection<o:p></o:p>

     <o:p></o:p>

    If you do not have such a Registry key, then just create it.<o:p></o:p>

    There you need to create a new a 32-bit DWORD value named AllowTelemetry and set it to 0.<o:p></o:p>

     <o:p></o:p>

    Now, you need to disable a couple of Windows services. Right click the File Explorer item in Windows 10 Start menu and pick Manage from its context menu:<o:p></o:p>

     <o:p></o:p>

    Go to Services and Applications -> Services in the left pane. In the services list, disable the following services:<o:p></o:p>

    Diagnostics Tracking Service or Connected User Experiences and Telemetry<o:p></o:p>

    dmwappushsvc<o:p></o:p>

    Double click the mentioned services and pick "Disabled" for the startup type<o:p></o:p>

     <o:p></o:p>

    You need to restart for changes to take effect. <o:p></o:p>

    Tuesday, August 1, 2017 7:06 PM
  • I am experiencing the same issue after applying the fix mentioned above. The DPM server is using a dpm disk formatted with RefS and sits on on RAID 60 Volume.

    The "Connected User Experiences and Telemetry" and "dmwappushsvc" are stopped and disabled.  
    The registry changes mentioned of added "AllowTelemetry" at HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\DataCollection and setting this to 0 is applied as well, see link https://ibb.co/kkn8nw


    My crash dump is below : 

    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************

    BAD_POOL_CALLER (c2)
    The current thread is making a bad pool request.  Typically this is at a bad IRQL level or double freeing the same allocation, etc.
    Arguments:
    Arg1: 000000000000000b, Attempt to release quota on a corrupted pool allocation.
    Arg2: ffffd387067f8000, Address of pool
    Arg3: 0000000005c80200, Pool allocation's tag
    Arg4: ffffd387067f8040, Quota process pointer (bad).

    Debugging Details:
    ------------------

    Page 78e00 not present in the dump file. Type ".hh dbgerr004" for details
    Page 78e00 not present in the dump file. Type ".hh dbgerr004" for details
    Page 78e00 not present in the dump file. Type ".hh dbgerr004" for details

    DUMP_CLASS: 1

    DUMP_QUALIFIER: 401

    BUILD_VERSION_STRING:  14393.1794.amd64fre.rs1_release.171008-1615

    SYSTEM_MANUFACTURER:  Supermicro

    SYSTEM_PRODUCT_NAME:  Super Server

    SYSTEM_SKU:  Default string

    SYSTEM_VERSION:  0123456789

    BIOS_VENDOR:  American Megatrends Inc.

    BIOS_VERSION:  2.0a

    BIOS_DATE:  08/01/2016

    BASEBOARD_MANUFACTURER:  Supermicro

    BASEBOARD_PRODUCT:  X10SRi-F

    BASEBOARD_VERSION:  1.01B

    DUMP_TYPE:  1

    BUGCHECK_P1: b

    BUGCHECK_P2: ffffd387067f8000

    BUGCHECK_P3: 5c80200

    BUGCHECK_P4: ffffd387067f8040

    FAULTING_IP: 
    ReFS!FsLibCleanupPeriodicPerfData+cf38
    fffff80a`9f8a36cc 48891d6d04fcff  mov     qword ptr [ReFS!TelemetryGlobalPerfContext+0x120 (fffff80a`9f863b40)],rbx

    BUGCHECK_STR:  0xc2_b

    CPU_COUNT: 10

    CPU_MHZ: 834

    CPU_VENDOR:  GenuineIntel

    CPU_FAMILY: 6

    CPU_MODEL: 4f

    CPU_STEPPING: 1

    CPU_MICROCODE: 6,4f,1,0 (F,M,S,R)  SIG: B00001B'00000000 (cache) B00001B'00000000 (init)

    DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

    PROCESS_NAME:  System

    CURRENT_IRQL:  0

    ANALYSIS_SESSION_HOST:  XXXX-XXXXX

    ANALYSIS_SESSION_TIME:  12-06-2017 09:32:59.0537

    ANALYSIS_VERSION: 10.0.15063.468 amd64fre

    LAST_CONTROL_TRANSFER:  from fffff8032a44fd97 to fffff8032a356790

    STACK_TEXT:  
    ffffbd80`90479858 fffff803`2a44fd97 : 00000000`000000c2 00000000`0000000b ffffd387`067f8000 00000000`05c80200 : nt!KeBugCheckEx
    ffffbd80`90479860 fffff80a`9f8a36cc : ffffd387`067f8040 fffff80a`9f8a3744 01d36e4a`4015fc88 00000000`000000c5 : nt!ExFreePoolWithTag+0x1d97
    ffffbd80`90479940 fffff80a`9f896739 : fffff803`2a5bc100 fffff80a`9f864028 00000000`00000000 00000000`00000200 : ReFS!FsLibCleanupPeriodicPerfData+0xcf38
    ffffbd80`90479970 fffff80a`9f896706 : 00000000`00000001 ffffffff`d447e880 fffff80a`9f8965d0 fffff80a`9f864028 : ReFS!RefsPeriodicCleanup+0x9
    ffffbd80`904799a0 fffff803`2a26d0e9 : ffff9686`9e4c0800 ffff9686`9e4c0800 fffff803`00000000 00000000`00000200 : ReFS!RefsPeriodicTimerCallback+0x136
    ffffbd80`90479b80 fffff803`2a2bc595 : 00000000`00000000 00000000`00000080 ffff9686`67c1e600 ffff9686`9e4c0800 : nt!ExpWorkerThread+0xe9
    ffffbd80`90479c10 fffff803`2a35bc56 : ffffbd80`83200180 ffff9686`9e4c0800 fffff803`2a2bc554 00000000`00000000 : nt!PspSystemThreadStartup+0x41
    ffffbd80`90479c60 00000000`00000000 : ffffbd80`9047a000 ffffbd80`90474000 00000000`00000000 00000000`00000000 : nt!KiStartSystemThread+0x16


    STACK_COMMAND:  kb

    THREAD_SHA1_HASH_MOD_FUNC:  55c488f73b35be49f877bd7748c042e5f0fbcc2f

    THREAD_SHA1_HASH_MOD_FUNC_OFFSET:  e23091ed42102d5c1977a5acd6b1cbe64980d88e

    THREAD_SHA1_HASH_MOD:  4c78d5a8366235ed698229c39e189c6ea925ad0e

    FOLLOWUP_IP: 
    ReFS!FsLibCleanupPeriodicPerfData+cf38
    fffff80a`9f8a36cc 48891d6d04fcff  mov     qword ptr [ReFS!TelemetryGlobalPerfContext+0x120 (fffff80a`9f863b40)],rbx

    FAULT_INSTR_CODE:  6d1d8948

    SYMBOL_STACK_INDEX:  2

    SYMBOL_NAME:  ReFS!FsLibCleanupPeriodicPerfData+cf38

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: ReFS

    IMAGE_NAME:  ReFS.SYS

    DEBUG_FLR_IMAGE_TIMESTAMP:  59bf2b6f

    IMAGE_VERSION:  10.0.14393.1770

    BUCKET_ID_FUNC_OFFSET:  cf38

    FAILURE_BUCKET_ID:  0xc2_b_ReFS!FsLibCleanupPeriodicPerfData

    BUCKET_ID:  0xc2_b_ReFS!FsLibCleanupPeriodicPerfData

    PRIMARY_PROBLEM_CLASS:  0xc2_b_ReFS!FsLibCleanupPeriodicPerfData

    TARGET_TIME:  2017-12-06T04:25:12.000Z

    OSBUILD:  14393

    OSSERVICEPACK:  0

    SERVICEPACK_NUMBER: 0

    OS_REVISION: 0

    SUITE_MASK:  272

    PRODUCT_TYPE:  3

    OSPLATFORM_TYPE:  x64

    OSNAME:  Windows 10

    OSEDITION:  Windows 10 Server TerminalServer SingleUserTS

    OS_LOCALE:  

    USER_LCID:  0

    OSBUILD_TIMESTAMP:  2017-10-09 03:45:44

    BUILDDATESTAMP_STR:  171008-1615

    BUILDLAB_STR:  rs1_release

    BUILDOSVER_STR:  10.0.14393.1794.amd64fre.rs1_release.171008-1615

    ANALYSIS_SESSION_ELAPSED_TIME:  86ef

    ANALYSIS_SOURCE:  KM

    FAILURE_ID_HASH_STRING:  km:0xc2_b_refs!fslibcleanupperiodicperfdata

    FAILURE_ID_HASH:  {aa38bb81-cdf4-a9b1-94c3-5549ec39dba2}

    Followup:     MachineOwner
    ---------




    Please assist.



    • Edited by tc_SysAdmin Wednesday, December 6, 2017 8:57 AM
    Wednesday, December 6, 2017 8:54 AM
  • Have you installed the latest cumulative upate for Windows Server AND DPM?

    If not: Do so ASAP.
    Wednesday, December 6, 2017 11:01 AM
  •  UR4 for dpm was installed but not KB4051033 which came out on 22/11/2017. I have installed it now and will keep an eye on the server.  
    Wednesday, December 6, 2017 3:05 PM