none
Replica becomes inconsistent - CC jobs failing with error "An unexpected internal error occurred in the StorageManager. (ID 40001)" RRS feed

  • Question

  • The replica becomes inconsistent for all data sources and CC jobs are failing with error id 40001, all REFS registry settings are in place and WMI memory per host configured to 1 GB. But still the issue persists. No disk backups are happening. Even recovery point creation jobs are failing with the same error ID 40001, I googled this error ID looks like no one faced this issue except me.

    Have raised the ticket with Microsoft, and upgraded the MABS V3 with UR1 package, the UR1 not helped us and they are still analyzing the logs. DPM / MABS V3 UR1 sucks !!!! Really Microsoft have to check on their product's reliability before releasing.

    Tuesday, June 30, 2020 6:35 PM

All replies

  • Hi,

    If MS ticket isn't progressing well and if you are interested in trying forum support, then please share the MSDPM logs and let's see what best we can do. Keep in mind that forum support will take time due to its nature of support


    Tuesday, June 30, 2020 8:11 PM
  • Hi Aayoosh,

    Sorry for delay reply. Can you share me link to share the DPM logs

    Thanks,

    Sunday, July 5, 2020 1:46 PM
  • You can upload it in your onedrive and share the link with me. Also, please share your MS ticket as well. 

    Upload MSDPM and DPMRA logs from DPM server and DPMRA logs from PS along with failed Job details. 

    Sunday, July 5, 2020 2:51 PM
  • Aayoosh,

    One drive link is https://1drv.ms/u/s!AmkasGNfxRS1apHJbsAzVf5cRik?e=purWac 

    MS ticket no - 120061921001344

    Let me know if you find anything.

    Update - If, I restart MABS server / DPM services, CC jobs will run without issues. during mid night exactly while the pruning job runs, then healthy replicas will become in consistent.

    Thanks,

    Asrar Hussain

    Sunday, July 5, 2020 5:24 PM
  • Hi Asrar,

    Had a quick look at the MSDPM logs and below is my analysis. I have shared the step

    CC job got triggered in below event: Task ID: 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C

    15F8 128C 07/05 15:16:13.13227 DecideConsistencyCheckNeededBlock.cs(126) 63342D2C-D06- F98-B9F3-6E9AEFA9012C NORMAL Consistency Check (Needed) for datasource=54180ccc-e0a1- eed-ae5f-50d6c10643e4

    PIT creation call sent to initiate the backup process using REFS Clone mechanism:

    15F8 1114 07/05 15:16:13.49111 StorageManager.cs(976) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL ==>CreatePIT(RequestId: 63342d2c-0d06-4f98-b9f3-6e9aefa9012c ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 StorageId: 43fc5a7d-5321-410b-8fae- b6a840f85ba StorageType: ReFSVolume) 

    Respected VHD got mounted so that backup can start and write data into it.

    15F8 1114 07/05 15:17:30.925 11 VhdContainer.cs(402) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL MountRequest with RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d- 9e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    VHDx was mounted successfully in the below logs. I dont see we had issues upto this mount

    15F8 1114 07/05 15:19:19.953 11 VhdContainer.cs(404) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL MountRequest completed for requestId: CreatePIT ContainerId:7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    Right after this I am seeing below error: This is where you are getting "An unexpected internal error occurred in the StorageManager. (ID 40001)" Error. The problem lies in the below code:

    partitioninstrumentation.cpp(138) WARNING Failed: Hr: = [0x80070005] : Encountered Failure: : lVal : DeviceIoControl( *m_phPartition, ( ((0x00000007) << 16) | ((( 0x0001 ) | ( 0x0002 )) << 14) | ((0x0013) << 2) | (0) ), &partitionInfo, static_cast<DWORD>(sizeof(SET_PARTITION_INFORMATION_EX)), 0, 0, &dwBytesReturned, 0 )

    partitioninstrumentation.cpp(87) WARNING Failed: Hr: = [0x80070005] : Encountered Failure: : lVal : SetPartitionInformation(partitioInfo)


    After the job failed, VHD unmount request was made which was also completed successfully. 

    15F8 1114 07/05 15:19:20.172 11 VhdContainer.cs(448) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL UnmountRequest with RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    15F8 1114 07/05 15:19:34.697 11 VhdContainer.cs(450) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL UnmountRequest completed for RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    Create PIT call now got returned as failure:

    15F8 1114 07/05 15:19:34.713 11 StorageManager.cs(976) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL <--CreatePIT

    And CC job throws the SMInternalError as mentioned below:

    15F8 1758 07/05 15:19:34.713 27 ShadowCopyBlock.cs(271) WARNING Create SC failed with error code = SMInternalError


    Summary: So basically the issue is with the function: SET_PARTITION_INFORMATION_EX ->.  This code is used with the IOCTL IOCTL_DISK_SET_PARTITION_INFO_EX to set information for a specific partition (GPT/MBR). This is a subset of Disk Initialization process. Below is the pictorial diagram of it :






    It is trying to register/Set the correct Partition for the VHDx so that it can read the data/write data to it. Before reading the partition type, its not possible to read/write data to it. 

    Please help me provide some more data so that we can dig more into it.

    1. Eventviewer logs (Application and System) in evtx format.
    2. Enable Analytical Logs in the eventviewer Click View -> Show Analytical and Debug log and Enable it for VHDMP,VDRV, StorePort, StorDiag and Store. 
    3. Collect a ProcMon trace while running a CC job (while reproducing the issue)
    4. Which AV you are using in your DPM server? Have you added the exceptions for DPM? 
    5. Share FLTMC output as well.
    6. What is the hardware resources in DPM server? And how is the utilization at the time of issue? Do you experience any lag or performance issue. 
    7. If your MABs server is 2016 and running on hyper-v Server then please check if event ID 27268 is logged in the Hyper-V-VMSS logs with the description as “Virtual hard disk files must be uncompressed and unencrypted and must not be sparse”.

    8. Try to mount the VHDx manually using below PS cmdlets at the time of issue and see if it works this way. Once mounted, go to disk mgmt and see if it shows the disk partition type. You can Psexec to gain access of the VHD and try to browse.

    $ProtectionGroup = Get-DPMProtectionGroup
    $DataSource = Get-DPMDatasource -ProtectionGroup $ProtectionGroup[0]
    Mount-DPMRecoveryPoint -Datasource $DataSource[0]



    Sunday, July 5, 2020 6:46 PM
  • Thank you Aaayoosh for the detailed analysis and steps,

    What is the hardware resources in DPM server? And how is the utilization at the time of issue? Do you experience any lag or performance issue. 

    Its a physical server the below specs

    CPU - 8 Cores

    Memory - 64 GB

    SAN storage - 150 TB

    Storage pools used in DPM - E:\ with 70 TB, F:\ with 70 TB

    Server OS - Windows server 2019 (Updated from Server 2016 recently, issue started after upgrading the server)

    Recently updated MABS V3 with UR1

    Which AV you are using in your DPM server? Have you added the exceptions for DPM? 

    Windows defender, I have turned off threat protection currently, after seeing the access denied issues.

    OneDrive link for event logs and other logs - https://1drv.ms/u/s!AmkasGNfxRS1bQx8WsgJPo9Y5fs?e=Szln9L


    Monday, July 6, 2020 5:37 AM
  • Thanks for the logs, I would need a few more details:

    Did you collect the Procmon logs at the time of CC job failure? If yes, please share the time. It will help me going though the event logs and Procmon time.

    In the Procmon trace I see that it was collected from Windows Server 2016 (build 17763.2) with 16CPU cores and 64gb RAM. Confirm the OS for MABs server.

    Please make sure the Procmon logs were collected at the time of issue. 

    Monday, July 6, 2020 6:34 AM
  • Please find the current PROCMON logs captured during the issue 

    https://1drv.ms/u/s!AmkasGNfxRS1blFzR0nMt_BRTbE?e=h7EQKV

    Monday, July 6, 2020 5:45 PM
  • Hi Asrar,

    I had a look at the Procmon and below is my finding. I will try to keep it short and precise for better understanding.

    First observation, Seeing too many Random IOPS against the backup drive (Specially against F:\)

    During this time, all I am seeing is READ operation on the DPM Disk pool.

    It seems like, there were few online jobs running at around 9am system time. This was also causing extra IOPS on the disk.

    Next, it is making the same call as explained before i.e. IOCTL_DISK_SET_PARTITION_INFO_EX over and over again. That means it is probably erroring out while querying the partition for the disk.

    After few seconds, it was querying the Storage properties and other functions and it caught an error: That's an IRP call failed with 0x10 hex against the function called: IOCTL_Disk_Is_Writable.

    Error 0x10 converts to below message:

       

    C:\Temp>Err_6.4.5.exe 0x10 | findstr "winerror.h"
    # 18 matches found for "0x10"
    ERROR_CURRENT_DIRECTORY                                        winerror.h
    ERROR_CURRENT_DIRECTORY                                        winerror.h

    And in next event, it again fails against the same function but with different error: i.e. 0x84

       

    C:\Temp>Err_6.4.5.exe 0x84 | findstr "winerror.h"
    # 4 matches found for "0x84"
    ERROR_SEEK_ON_DEVICE                                           winerror.h
    ERROR_SEEK_ON_DEVICE                                           winerror.h

    Potential cause: Looking at the errors and behavior, I am suspecting a bottleneck, probably at the storage side or a bottleneck caused in OS which is resulting a storage error. 

    Please answer below questions:

    * How is your DPM storage connected? Is it Using ISCSI or FC? I am asking because I noticed iscsidcb.exe process running extensively and it is mostly used in data center bridging. Please explain your storage configuration and iscsidcb.exe service role in it.

    * Have your checked your storage controller drivers or firmware? Are they updated?

    * Do you have any hardware diagnostic or monitoring tool available which can show some diagnostic against your server storage?

    * Can you temporarily stop online backups ? I hope it is not interfering with the DPM pool at all.

    * Next time when the issue occurs, stop (Do not restart) the MSDPM and DPM service. And look at the diskmgmt.msc and see how many disks are mounted. Once done, start the dpm serives and use below cmdlets to mount a replica volume manually: Collect Procmon when mounting this manually and share if it fails.

    $ProtectionGroup = Get-DPMProtectionGroup
    $DataSource = Get-DPMDatasource -ProtectionGroup $ProtectionGroup[0]

    Mount-DPMRecoveryPoint -Datasource $DataSource[0]

    Share the data and we will see what is the next step of investigation we can do.  :)

    Monday, July 6, 2020 8:03 PM
  • Storage is connected using FC and not iSCSi

    Storage controllers are good and firmware updated on Feb 2020.


    Also one more thing, I am seeing the below error / Warning in event logs around the same time when this issue happening on MABS.

    Event logs

    Log Name:      Application
    Source:        Microsoft-Windows-PerfNet
    Date:          7/6/2020 3:54:36 AM
    Event ID:      2004
    Task Category: None
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      
    Description:
    Unable to open the Server service performance object. The first four bytes (DWORD) of the Data section contains the status code.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-PerfNet" Guid="{CAB2B8A5-49B9-4EEC-B1B0-FAC21DA05A3B}" EventSourceName="PerfNet" />
        <EventID Qualifiers="49152">2004</EventID>
        <Version>0</Version>
        <Level>2</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2020-07-06T10:54:36.749525300Z" />
        <EventRecordID>5769</EventRecordID>
        <Correlation />
        <Execution ProcessID="0" ThreadID="0" />
        <Channel>Application</Channel>
        <Computer></Computer>
        <Security />
      </System>
      <EventData>
        <Binary>220000C0</Binary>
      </EventData>
    </Event>

    Log Name:      Application
    Source:        Microsoft-Windows-Perflib
    Date:          7/6/2020 3:54:36 AM
    Event ID:      1008
    Task Category: None
    Level:         Warning
    Keywords:      
    User:          NETWORK SERVICE
    Computer:  
    Description:
    The Open procedure for service "BITS" in DLL "C:\Windows\System32\bitsperf.dll" failed with error code Access is denied.. Performance data for this service will not be available.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="Microsoft-Windows-Perflib" Guid="{13b197bd-7cee-4b4e-8dd0-59314ce374ce}" />
        <EventID>1008</EventID>
        <Version>1</Version>
        <Level>3</Level>
        <Task>0</Task>
        <Opcode>0</Opcode>
        <Keywords>0x8000000000000000</Keywords>
        <TimeCreated SystemTime="2020-07-06T10:54:36.425397200Z" />
        <EventRecordID>5770</EventRecordID>
        <Correlation />
        <Execution ProcessID="7464" ThreadID="8000" />
        <Channel>Application</Channel>
        <Computer></Computer>
        <Security UserID="S-1-5-20" />
      </System>
      <EventData>
        <Data Name="Service">BITS</Data>
        <Data Name="Library">C:\Windows\System32\bitsperf.dll</Data>
        <Data Name="Win32Error">5</Data>
      </EventData>
    </Event>

    Monday, July 6, 2020 9:17 PM
  • Hi

    Not sure if this is relevant to the VHD problem which we have identified. But I guess our main lead is the below error after mounting the VHD for new RP/CC jobs:

    ERROR_SEEK_ON_DEVICE

    132 (0x84)

    The file pointer cannot be set on the specified device or file.

    For now, I would suggest you to disable the online backup for at least few days. As online backup is only component which could interfere with the DPM Replica VHDs. I just want to isolate that online backups are not interfering with it at all and also not causing write delay on the VHDx write operation.

    If issue reoccurs, we can try collecting VDS logman trace to understand the VDS part as well. Though I don't have the symbols to convert the ETL traces, but I guess there is a way. But first please disable the online backups completely and observer the DPM server.


    Tuesday, July 7, 2020 2:06 PM