none
Replica becomes inconsistent - CC jobs failing with error "An unexpected internal error occurred in the StorageManager. (ID 40001)" RRS feed

  • Question

  • The replica becomes inconsistent for all data sources and CC jobs are failing with error id 40001, all REFS registry settings are in place and WMI memory per host configured to 1 GB. But still the issue persists. No disk backups are happening. Even recovery point creation jobs are failing with the same error ID 40001, I googled this error ID looks like no one faced this issue except me.

    Have raised the ticket with Microsoft, and upgraded the MABS V3 with UR1 package, the UR1 not helped us and they are still analyzing the logs. DPM / MABS V3 UR1 sucks !!!! Really Microsoft have to check on their product's reliability before releasing.

    Tuesday, June 30, 2020 6:35 PM

All replies

  • Hi,

    If MS ticket isn't progressing well and if you are interested in trying forum support, then please share the MSDPM logs and let's see what best we can do. Keep in mind that forum support will take time due to its nature of support


    Tuesday, June 30, 2020 8:11 PM
  • Hi Aayoosh,

    Sorry for delay reply. Can you share me link to share the DPM logs

    Thanks,

    19 hours 10 minutes ago
  • You can upload it in your onedrive and share the link with me. Also, please share your MS ticket as well. 

    Upload MSDPM and DPMRA logs from DPM server and DPMRA logs from PS along with failed Job details. 

    18 hours 5 minutes ago
  • Aayoosh,

    One drive link is https://1drv.ms/u/s!AmkasGNfxRS1apHJbsAzVf5cRik?e=purWac 

    MS ticket no - 120061921001344

    Let me know if you find anything.

    Update - If, I restart MABS server / DPM services, CC jobs will run without issues. during mid night exactly while the pruning job runs, then healthy replicas will become in consistent.

    Thanks,

    Asrar Hussain

    15 hours 32 minutes ago
  • Hi Asrar,

    Had a quick look at the MSDPM logs and below is my analysis. I have shared the step

    CC job got triggered in below event: Task ID: 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C

    15F8 128C 07/05 15:16:13.13227 DecideConsistencyCheckNeededBlock.cs(126) 63342D2C-D06- F98-B9F3-6E9AEFA9012C NORMAL Consistency Check (Needed) for datasource=54180ccc-e0a1- eed-ae5f-50d6c10643e4

    PIT creation call sent to initiate the backup process using REFS Clone mechanism:

    15F8 1114 07/05 15:16:13.49111 StorageManager.cs(976) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL ==>CreatePIT(RequestId: 63342d2c-0d06-4f98-b9f3-6e9aefa9012c ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 StorageId: 43fc5a7d-5321-410b-8fae- b6a840f85ba StorageType: ReFSVolume) 

    Respected VHD got mounted so that backup can start and write data into it.

    15F8 1114 07/05 15:17:30.925 11 VhdContainer.cs(402) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL MountRequest with RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d- 9e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    VHDx was mounted successfully in the below logs. I dont see we had issues upto this mount

    15F8 1114 07/05 15:19:19.953 11 VhdContainer.cs(404) 63342D2C-0D06-4F98-B9F3- E9AEFA9012C NORMAL MountRequest completed for requestId: CreatePIT ContainerId:7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    Right after this I am seeing below error: This is where you are getting "An unexpected internal error occurred in the StorageManager. (ID 40001)" Error. The problem lies in the below code:

    partitioninstrumentation.cpp(138) WARNING Failed: Hr: = [0x80070005] : Encountered Failure: : lVal : DeviceIoControl( *m_phPartition, ( ((0x00000007) << 16) | ((( 0x0001 ) | ( 0x0002 )) << 14) | ((0x0013) << 2) | (0) ), &partitionInfo, static_cast<DWORD>(sizeof(SET_PARTITION_INFORMATION_EX)), 0, 0, &dwBytesReturned, 0 )

    partitioninstrumentation.cpp(87) WARNING Failed: Hr: = [0x80070005] : Encountered Failure: : lVal : SetPartitionInformation(partitioInfo)


    After the job failed, VHD unmount request was made which was also completed successfully. 

    15F8 1114 07/05 15:19:20.172 11 VhdContainer.cs(448) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL UnmountRequest with RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    15F8 1114 07/05 15:19:34.697 11 VhdContainer.cs(450) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL UnmountRequest completed for RequestId: CreatePIT ContainerId: 7b3e90a1-ff4d-49e6-924c-99b92abc5805 PitId: f5ea6535-370a-4192-8361-3ab9aa9a38f9

    Create PIT call now got returned as failure:

    15F8 1114 07/05 15:19:34.713 11 StorageManager.cs(976) 63342D2C-0D06-4F98-B9F3-6E9AEFA9012C NORMAL <--CreatePIT

    And CC job throws the SMInternalError as mentioned below:

    15F8 1758 07/05 15:19:34.713 27 ShadowCopyBlock.cs(271) WARNING Create SC failed with error code = SMInternalError


    Summary: So basically the issue is with the function: SET_PARTITION_INFORMATION_EX ->.  This code is used with the IOCTL IOCTL_DISK_SET_PARTITION_INFO_EX to set information for a specific partition (GPT/MBR). This is a subset of Disk Initialization process. Below is the pictorial diagram of it :






    It is trying to register/Set the correct Partition for the VHDx so that it can read the data/write data to it. Before reading the partition type, its not possible to read/write data to it. 

    Please help me provide some more data so that we can dig more into it.

    1. Eventviewer logs (Application and System) in evtx format.
    2. Enable Analytical Logs in the eventviewer Click View -> Show Analytical and Debug log and Enable it for VHDMP,VDRV, StorePort, StorDiag and Store. 
    3. Collect a ProcMon trace while running a CC job (while reproducing the issue)
    4. Which AV you are using in your DPM server? Have you added the exceptions for DPM? 
    5. Share FLTMC output as well.
    6. What is the hardware resources in DPM server? And how is the utilization at the time of issue? Do you experience any lag or performance issue. 
    7. If your MABs server is 2016 and running on hyper-v Server then please check if event ID 27268 is logged in the Hyper-V-VMSS logs with the description as “Virtual hard disk files must be uncompressed and unencrypted and must not be sparse”.

    8. Try to mount the VHDx manually using below PS cmdlets at the time of issue and see if it works this way. Once mounted, go to disk mgmt and see if it shows the disk partition type. You can Psexec to gain access of the VHD and try to browse.

    $ProtectionGroup = Get-DPMProtectionGroup
    $DataSource = Get-DPMDatasource -ProtectionGroup $ProtectionGroup[0]
    Mount-DPMRecoveryPoint -Datasource $DataSource[0]



    14 hours 9 minutes ago
  • Thank you Aaayoosh for the detailed analysis and steps,

    What is the hardware resources in DPM server? And how is the utilization at the time of issue? Do you experience any lag or performance issue. 

    Its a physical server the below specs

    CPU - 8 Cores

    Memory - 64 GB

    SAN storage - 150 TB

    Storage pools used in DPM - E:\ with 70 TB, F:\ with 70 TB

    Server OS - Windows server 2019 (Updated from Server 2016 recently, issue started after upgrading the server)

    Recently updated MABS V3 with UR1

    Which AV you are using in your DPM server? Have you added the exceptions for DPM? 

    Windows defender, I have turned off threat protection currently, after seeing the access denied issues.

    OneDrive link for event logs and other logs - https://1drv.ms/u/s!AmkasGNfxRS1bQx8WsgJPo9Y5fs?e=Szln9L


    3 hours 19 minutes ago
  • Thanks for the logs, I would need a few more details:

    Did you collect the Procmon logs at the time of CC job failure? If yes, please share the time. It will help me going though the event logs and Procmon time.

    In the Procmon trace I see that it was collected from Windows Server 2016 (build 17763.2) with 16CPU cores and 64gb RAM. Confirm the OS for MABs server.

    Please make sure the Procmon logs were collected at the time of issue. 

    2 hours 21 minutes ago