locked
Recovery after ReFS events 133 + 513 (apparent data loss on dual parity) RRS feed

  • Question

  • Hi,
    I have a single-node Windows server 2016 with a dual parity storage space, on which a bitlockered ReFS volume resides with enabled file integrity. This ReFS volume hosted/contained a ~17TB vhdx file with archive data since its setup half a year ago. This file has now suddenly been removed by ReFS! More precisely, I see the following two events in system log:

    1. Microsoft-Windows-ReFS Event ID 133 (Error): The file system detected a checksum error and was not able to correct it. The name of the file or folder is "R:\Extended Data Archive@dParity.vhdx".
    2. immediately followed by Microsoft-Windows-ReFS Event ID 513 (Warning): The file system detected a corruption on a file. The file has been removed from the file system namespace. The name of the file is "R:\Extended Data Archive@dParity.vhdx".


    I have the following questions:

    1. As 26TB are still used on volume level (constant, not decreasing over time), but only 7TB of files are visible, I assume that ReFS did not yet delete the missing vhdx file. How can I get read access to the corrupt vhdx file again for manual recovery of its internal file system?
    2. If I understand dual parity correctly, at least two physical disks must have failed simultaneously for this to happen. I do not see any useful events in the system log regarding this. How can I get any clues, which of the physical disks in my array need to be replaced?
      (Their SMART level is 100%. I plan to run extended SMART self tests on each individual physical disk, but only after data recovery. Still, Windows or ReFS might have logged some clues about the physical disks involved in this checksum error somewhere?)

    Thanks.


    • Edited by 'Michael G.' Sunday, November 11, 2018 10:11 AM clarified text and formatting.
    Friday, November 9, 2018 11:27 AM

All replies

  • Meanwhile I found a docu at https://docs.microsoft.com/en-us/windows-server/storage/refs/refs-overview that confirms that "ReFS removes the corrupt data from the namespace" for data salvaging. However, I could not find any docu, yet, that explains how to get read access to the corrupt vhdx file again for manual data recovery via its internal file system. Microsoft, could you please give me some hints? What are ReFS "namespaces" and is there maybe a PowerShell API for controlling/browsing them? Thanks.

    Sunday, November 11, 2018 10:09 AM
  • Hi Michael,

    Thanks for posting in our forum

    I did some research, but didn't find the official document on how to read the damaged VHD file again.

    We will continue to study this issue. If we have any updates or any ideas on this issue, we will release it as soon as possible. Thank you for your understanding. If you have further information during this time, you can post on the forum, which will help us fully understand and analyze this issue.

    Thank you for your cooperation and patience.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, November 12, 2018 10:09 AM
  • Okay, thanks; I'll be waiting, then.


    For reference, here is the power shell code that created the dual parity vDisk containing the missing vhdx half a year ago:

    #controllers = 1x LSI SAS2 9207-8i + 1x LSI SAS2 9217-8i (both in HBA mode with newest firmware v20.00.07.00).

    #$journalSSDs = 3 SSDs (2x Intel SSDSC2BB24 + 1x Intel SSDSC2BB48)

    #$HDDs = 10 HDDs (2x Toshiba HDWE160 6TB CMR + 2x Seagate ST8000DM005 8TB CMR + 6x Seagate ST8000AS0002 8TB SMR)

     

    $sPool = "BackupHost StoragePool"

    New-StoragePool `

        -StorageSubsystemFriendlyName "Windows Storage*" `

        -FriendlyName $sPool `

        -PhysicalDisks ($HDDs + $journalSSDs) `

    ;

    Set-StoragePool -FriendlyName $sPool -IsPowerProtected $True #have UPS.

     

    $sVDiskFriendlyName = "vDisk(jdMirr.1c,dPar.(7+3)c.64GBWBC.16MBiL)).ReFS.4k";

    $WBC4mirrAccParity = 64GB;

    $parityVDisk = Get-StoragePool $sPool | New-VirtualDisk -FriendlyName $sVDiskFriendlyName `

        -MediaType HDD `

        -FaultDomainAwareness PhysicalDisk `

        -ResiliencySettingName Parity -PhysicalDiskRedundancy 2 `

        -NumberOfColumns (7+3) `

        -Size 44035GB `

        -ProvisioningType Thin `

        -Interleave 16MB `

        -WriteCacheSize $WBC4mirrAccParity `

    ; #<-Note: a larger write cache and interleave mean longer pauses for individual HDDs in sustained write phases => higher overall throughput, as individual SMR disks utilize these pauses for moving their data from CMR to shingled magnetic recording regions.

    $sVolumeLabel = "SA (WBC-acc dParity.10c)";

    $driveLetter = "R:";

    $parityReFSVol = New-Volume -FriendlyName $sVolumeLabel `

        -DiskUniqueId $parityVDisk.UniqueId `

        -FileSystem ReFS `

        -AccessPath $driveLetter `

    ;

    Get-Item -Path $driveLetter\ | Set-FileIntegrity -Enable $true

        Get-Item -Path $driveLetter\ | Get-FileIntegrity

    Get-ChildItem -Path $driveLetter\ -Recurse | Set-FileIntegrity -Enable $true

        Get-ChildItem -Path $driveLetter\ -Recurse | Get-FileIntegrity | ft Enabled, Enforced, FileName


    This ReFS volume with enabled file integrity stored the now missing VHDX containing our measurement data archive. The vhdx inner file system was also ReFs, but I left FileIntegrity at default here, i.e. disabled. This vhdx was a shared network drive; worked well for about half a year until now... :/
    Notably, this storage pool contains another virtual disk (also with ReFS and enabled file integrity, but only single-parity). Meanwhile, I backupped its contents (also several TBs) to another computer without any errors, although it resides on the same physical disks...

    Tuesday, November 13, 2018 10:02 AM
  • Hi Michael,

    Thanks for your reply!

    1.Generally speaking, Refs detects corrupted data by checking with metadata and file data.

    The detected damage is then automatically repaired with a copy of the alternate data provided by the storage space.

    But when a volume is corrupted and the alternate copy of the corrupted data does not exist, Refs removes the corrupted data from the namespace.

    i didn't find the official documentation on how to restore the files in Refs. This link just for your reference. it mentions a tool called ReclaiMe File Recovery, if you want to try to use it, please back up your system and all your data first. 

    https://social.technet.microsoft.com/Forums/windowsserver/en-US/7abf7f65-1f0f-4766-8894-ae56b85b3700/refs-volume-is-not-accessible-file-system-shows-raw?forum=winserver8gen

    2. From my personal point of view, i suggest you check RAID card configuration. You can ask storage vendor and RAID card vendor which physical disk need to be replaced.

    3. i didn't find any documentation about refs namespace, this link is an explanation of the namespace, just for your reference:

    https://en.wikipedia.org/wiki/Namespace

    Please Note: Since the web site is not hosted by Microsoft, the link may change without notice. Microsoft does not guarantee the accuracy of this information.

    In addition, i suggest you back up your data. When a volume is corrupted and alternate copy of the 

    corrupted data does not exist, you can recover your data from the backup when this deletion occurs.

    Thanks for your time, if you have any question, please feel free to let me know.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Wednesday, November 14, 2018 8:50 AM
  • Hi Daniel,

    I appreciate your efforts, but, well, this is (or was) our backup server... I purposely chose a dual parity storage space and ReFS for data reliability. I still can't believe that a checksum corruption in probably only a few or a single block of a ~17TB large vhdx file can effectively delete it entirely, without any supported way for admins to read-access it again! That would be totally counter-intuitive to the design purpose of ReFS!

    As for point (2): The two LSI 9207/9217-8i controllers were configured in HBA mode from the start, i.e. Windows always had and still has direct access to each individual physical drive. I.e. there is no hardware RAID layer in-between, everything (including SMART and disk errors) is passed-through to the OS.

    As for ReclaiMe: I have read the linked forum post about a ReFS drive gone RAW. My ReFS volume still mounts successfully, I just don't see the vhdx file in it any more. And there is a system log event that explains why (and that) ReFS did this on purpose.

    Thanks for the wiki link; I know namespaces in general form C# coding, but i fear that such general info is not very helpful here. I guess it just means an alternate "quarantine file table", but this is useless without any PowerShell command or something to mount that alternate ReFS namespace of the volume.

    Can't you just search for the origin of your "Microsoft-Windows-ReFS Event ID 513 (Warning)" in your ReFS.sys source code? Form there it should be quite easy to find out, into which "namespace" ReFS moved the file and which API can be used to move it back... Thanks!

    Best regards,
    Michael.

    Wednesday, November 14, 2018 10:29 AM
  • Hi Michael,

    Thanks for your reply!

    I have already asked the senior engineer, please wait for the reply. 

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, November 15, 2018 9:55 AM
  • Hi Daniel, okay, great; thank you! I'll be waiting, then. Best, Michael.

    Thursday, November 15, 2018 12:10 PM
  • Hi Michael,

    Thanks for your reply!

    We are studying your case, actually, it is a bit difficult for us.

    Please rest assured that I will contact you as soon as there is progress.

    Thanks for your understanding! if you have any concerns, please let me know.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, November 16, 2018 8:50 AM
  • Hi Michael,

    Thanks for your waiting!

    I just got the reply from the other engineer,  ReFS is designed to automatically correct corruption or recover error. So now it looks we are not able to recovery this vhdx file.

    ReFS introduces new features that can precisely detect corruptions and also fix those corruptions while remaining online, helping provide increased integrity and availability for your data:

    • Integrity-streams - ReFS uses checksums for metadata and optionally for file data, giving ReFS the ability to reliably detect corruptions.
    • Storage Spaces integration - When used in conjunction with a mirror or parity space, ReFS can automatically repair detected corruptions using the alternate copy of the data provided by Storage Spaces. Repair processes are both localized to the area of corruption and performed online, requiring no volume downtime.
    • Salvaging data - If a volume becomes corrupted and an alternate copy of the corrupted data doesn't exist, ReFS removes the corrupt data from the namespace. ReFS keeps the volume online while it handles most non-correctable corruptions, but there are rare cases that require ReFS to take the volume offline.
    • Proactive error correction - In addition to validating data before reads and writes, ReFS introduces a data integrity scanner, known as a scrubber. This scrubber periodically scans the volume, identifying latent corruptions and proactively triggering a repair of corrupt data.  

    Some other posts talked about this, below URL for your reference.

    https://social.technet.microsoft.com/forums/windowsserver/en-US/171a1808-157e-4ef9-b0dd-8a507ff6fcef/refs-corruption-when-filled-to-capacity

    https://social.technet.microsoft.com/Forums/windowsserver/en-US/12b55468-b556-46ab-96a5-86426a0c9531/recovery-of-corrupt-refs-drives?forum=winserver8gen

    Thanks again for your time, hope this information can help you, if you have any question, please let me know.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, November 16, 2018 9:36 AM
  • Hello Daniel,
    first of all, some news from me on the issue: In the meantime, I executed extended S.M.A.R.T. self tests on all HDDs and SSDs. It took ~16h per HDD, but it was possible to test them in parallel. Each of these full self tests "completed without error".

    Let me summarize our progress so far:
    - The MS senior engineer sent you a snippet of the same docu that I linked in my second post above. And he marked the same sentence about ReFS namespaces in red that I cited in my second post. :/
    - You conclude: "So now it looks we are not able to recover this vhdx file."

    I see two possibilities now:
    a) You are correct. This would mean that MS actually allows ReFS to permanently delete a 17TB virtual hard disk without asking the admin, just because some or even a single block of this huge file got a corruption that ReFS was unable to automatically fix. I.e. ReFS deleted tons of files in this vhdx file that had no corruption whatsoever! This is hard to believe, as it defies the very design purpose of reliability and robustness. If this was really true, this alone would be a reason to change back from ReFS to NTFS on all my servers and recommend everyone in my uni department to do likewise.

    b) I think ReFS still has a chance, as the volume is 43TB, of which 26.6TB are marked used, but only 7TB of files are visible. So it seems, ReFS still does protect this file somewhere and just "removed it from the namespace", as the cited docu suggests (i.e. hide, not delete). Or do you have any idea how to reclaim that missing 17TB chunk of volume free space?

    Did the contacted senior engineer read this thread in full and does he have access to the ReFS.sys source code, in order to do a quick search for the origin of the cited "Microsoft-Windows-ReFS Event ID 513 (Warning)" that *reported* the removal of the vhdx file "from the namespace" in system log?

    The other two links above either suggest a space problem (i still have 16.3TB free on this volume) or point to a third party recovery tool ReclaiMe, again. For what it's worth, I have download the test version and started scanning the volume. After a while (at 0.56%), it terminated with "Critical error occurred. The program can not continue execution and will be closed." Besides, unhiding a file that was automatically hidden by Windows without asking me (see above event ID 513) must be supported by Windows! At least give me read access to the file with whatever corruption ReFS found.

    Sincerely, Michael.

    Saturday, November 17, 2018 5:19 PM
  • Hi Michael,

    Thanks for your reply!

    I have already forward your reply to my colleague, in fact, your question is really difficult for us, we need more time to study it.

    Based on your current situation, I suggest you submit a case request with Microsoft to get further support.

    Thanks for your understanding, if you have any question, please feel free to let me know.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Monday, November 19, 2018 9:12 AM
  • Hi Daniel and everyone,

    now I understand what MSFT CSG means and that you probably cannot get any access to the ReFS.sys source code; sorry. I tried to file a support incident, as suggested, but got stuck via my personal MS account when it asks for the subscription/software assurance/AccessID. I don't have these for this uni/edu license and I do not want to choose the "pay 499$ single incident option". I thought MS might have a self-interest in solving this "ReFS + corruption + large VHDX data loss" scenario, but well. Maybe, some knowledgeable person still comes across this thread and can answer it later, but I have to move on.

    I will soon need to change the storage setup to something more reliable. I already copied all needed and still visible data and think in these lines:

    • Clean all disks, setup a new storage pool and use a mirror layout instead of dual parity. (Less usable capacity, but maybe more reliable?)
    • Most importantly, use NTFS again instead of ReFS for the storage spaces disk that will host the large VHDXs. (This will give me chkdsk back...)
    • Maybe, I should again opt for ReFS *within* the VHDX, with enabled file integrity? I would lose any set-it-and-forget-it self-healing (which effectively deleted a 17TB VHDX last time, anyway), but I would still be able to detect corruptions within the VHDX, even on a much more granular files level.
    • If I do so using Set-FileIntegrity, the default setting for the "Enforced" flag is $true. Maybe it would be smart to set the enforced flag to $false? In case of a corruption, maybe this will prevent ReFS from deleting/hiding the file, and just reports the corruption in system log?

    Just for completeness: I have checked the ECC RAM with MemTestX86 in the meantime. No errors, as expected (I use this machine also for long scientific computations and never had any crashes or failed asserts, despite long and high CPU&mem loads...). The only part of the system that I could not stress-test/self-test, are the LSI storage controllers. I don't see any errors from them, so I wonder if I really need to replace them. (Is some self-test utility for the controllers available?)

    Would the above storage spaces setup be overall more reliable and more recommended than my current "dual-parity + ReFS + file integrity + large VHDX" scenario? Does anyone have improvement suggestions?

    Thanks, Michael.

    Friday, November 23, 2018 1:21 PM
  • You allowed the file system to delete single corrupted files, which is, what it did. It's obviously no good idea on drives with archives or VHDX files. Within the virtual disk itself, yes, it is a good idea.
    Friday, November 23, 2018 1:33 PM
  • Hi Roland,

    >>Within the virtual disk itself, yes, it is a good idea.
    Today, I agree. Originally, I purposely enabled file integrity on the storage spaces backed volume to get corruption self-healing. Or is there any way to profit from dual parity/mirror layout within the VHDX? Isn't ReFS+self healing also used for Hyper-V storage (tons of VHDXs, there...)?

    >>You allowed the file system to delete single corrupted files, which is, what it did. It's obviously no good idea on drives with archives or VHDX files.
    Did I? Two questions come to my mind: (a) If ReFS really deleted that 17TB VHDX file, why didn't I get back these 17TB as free-space or how can I reclaim them? (b) Which docu warns you that if ReFS cannot heal a corruption, it deletes the entire file, without any possibility for read-access to the corrupted file for manual recovery? Isn't this a bit counter-intuitive for a resilient file system design?

    Michael.

    Friday, November 23, 2018 2:14 PM
  • Which docu? The one you found and quoted "ReFS removes the corrupt data from the namespace".

    You had intentionally activated that file integrity feature. By default, it is off.

    Friday, November 23, 2018 2:20 PM
  • Hey, you are right, of course it is weird that the file is entirely gone with the space still allocated. I have never had to deal with that and would not even know how to create intentional corruption in order to test all that.
    Friday, November 23, 2018 2:33 PM
  • I just re-read the docus; here are the relevant citations:

    1) about integrity streams with self healing from https://docs.microsoft.com/en-us/windows-server/storage/refs/integrity-streams

    • If ReFS is mounted on a resilient mirror or parity space, ReFS will attempt to correct the corruption.
    • If the attempt is successful, ReFS will apply a corrective write to restore the integrity of the data, and it will return the valid data to the application. The application remains unaware of any corruptions.
    • If the attempt is unsuccessful, ReFS will return an error.

    2) about the enforced flag (which is enabled by default if xou use file integrity; I did not set this flag to $true, proactively!) from https://docs.microsoft.com/en-us/powershell/module/storage/set-fileintegrity

    • Indicates whether to enable blocking access to a file if integrity streams indicate data corruption.

    So, it returned an error (in system log) as intended and it is just "blocking access", no deletion. This design (and also my original setup) seem absolutely reasonable to me, just the PowerShell/admin command for manually UNblocking access is missing (or at least undocumented)!

    Besides, where is the point in advertising ReFS self-healing capability on storage spaces, if it really removed your ability for manual recovery. Don't get me wrong: I want it to throw tons of errors and block access if it finds any corruption, but as starting point for manual action and not as "sorry, this was it" endpoint! (Even if it was a tiny docx file and not a huge vhdx.)

    Friday, November 23, 2018 2:38 PM
  • Hi,

    This is Daniel and wish you all the best!

    Just confirm the current status of the issue, whether Ronala's reply was helpful to you?

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Thursday, November 29, 2018 6:19 AM
  • Hi. No, unfortunately the problem still persists. Let me summarize:

    • ReFS detected a corruption in a 17TB VHDX file on a dual parity storage spaces disk, but was unable to fix that corruption automatically (ReFS error 133 in system log, in line with docu https://docs.microsoft.com/en-us/windows-server/storage/refs/integrity-streams).
    • File integrity had been enabled on this ReFS volume (via "Set-ChildItem -Path $driveLetter\ -Recurse | Set-FileIntegrity -Enable $true"). This also means that the "Enforced" flag is set, as $true is this flag's default setting. Consequently, ReFS blocked access to and hid this VHDX file (ReFS warning 513 in system log, in line with docu https://docs.microsoft.com/en-us/powershell/module/storage/set-fileintegrity).
    • still missing: A command or docu for manually unblocking/unhiding this file, i.e. getting read access to the corrupted file for admin data recovery. As this file is a huge VHDX, it is very likely that most of its inner files could be saved. The data is still available somewhere, as the 17TB are still allocated and have not returned to volume free space.

    Best regards, Michael.

    Thursday, November 29, 2018 9:57 AM
  • Hi,

    Thanks for your reply!

    I am sorry to hear that, please understand that this issue is really difficult for us.

    Based on your current situation, I suggest you submit a case request with Microsoft to get further support.

    Thanks for your understanding, if you have any question, please feel free to let me know.

    Best Regards,

    Daniel


    Please remember to mark the replies as answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com.

    Friday, November 30, 2018 2:31 AM
  • This is a terrible situation. I'm running in the same boat. I have a 3.2TB Veeam backup file on ReFS that got corrupted somehow, and is unaccessible, yet the space is still in use. I am very limited in disk space, and ReFS seems to have no (documented) ways to recover the file or make it available again. We do NOT use storage spaces. Given all issues we have with MS solutions, I'd for the coming years never trust my data in a MS storage solution over a full-blown SAN which is made for storing data. Still, ReFS has some nice features, for Veeam it has the block-cloning which is really nice. However, it now turns out this file system is really not mature enough for production for us. It just flags a file as corrupted, which is good, but then it's never to be available again. Or reclaim its space.

    Any update on this?

    [edit]

    I found the ReFSutil.exe has some options to recover / analyze / scrub files. However, I can find zero documentation on it. In fact, google returns below 100 hits on refsutil, bing even less, none of which are useful at all.

    To free the space I guess the 'triage' option can be used. However, I can't get it to work at all. It keeps telling me the fileID I provide is not valid. I get the fileid with 'fsutil file queryfileid <filename>'.

    That's the key issue with most MS technology today. They release something to the public, with no technical documentation at all. There's a lot of information on how some technologies will help you, how great it is and how it's the next step in human wellfare. Until you run into issues. (S2D corruption anyone? ReFS 4k blocksize terror anyone?) If you are lucky, someone at MS writes a blog about it. Raising tickets has been mostly useless  for us, out of about 20 tickets we've raised with Pro support, only two were resolved. None of them have I ever paid, meaning they're all accepted as bugs. But they never have been resolved.

    Oh, how I long to the days of VMS / VAX which had several kilograms of useful technical documentation.

    Merry Christmas everyone! I asked for stable MS products this year ;-)

    Sunday, December 23, 2018 9:11 AM
  • Hi Robert,

    sorry to hear that you suffer from the a similar problem. I agree with you, ReFS should not have been released for production without proper technical docu and tools covering the special cases (that are obviously coded and even have their own event IDs...).

    ReFSutil sounds promising, though. Maybe it even has an option to make a hidden corrupt file visible again for manual data recovery? I looked for the tool, but it does not seem to be contained in WS2016 Datacenter (newest 2018-12 updates installed). I found it on the W10 1809 client, but simply copying it (and the associated ReFSutil.exe.mui file) to WS2016 did not work: I only get "Unable to format message for id 400027ab - 13d". So, it seems to have some non-trivial dependencies... :/ (Do you know of any download source of a DISM/cab or similar setup package, maybe?)

    With respect to "It keeps telling me the fileID I provide is not valid", maybe it works if you supply the file ID of the containing parent folder or ancestor folders instead of the file directly?

    Wednesday, December 26, 2018 10:42 AM
  • Meanwhile, I have installed WS2019 via dual boot for testing the ReFSutil.exe that Robert Gijsen mentioned. Let me summarize:

    • the quick scan (refsutil salvage -QS R: C:\Temp\ReFSUtilWorkingDir -v) did not find the missing file that ReFS "removed from the namespace" after detecting a corruption
    • the quick scan including searching for deleted files (adding -m to the above command) found many files with the same name of older date that were not salvagable
    • the multi-day full scan *without* looking for deleted files (refsutil salvage -FS R: C:\Temp\ReFSUtilWorkingDir -v) found exactly one match for the missing file:
    Identified File: \Extended Data Archive@dParity.vhdx
    Size (0xd3ee1400000 Bytes) Volume Signature: 0x97053a74 Physical LCN: 0x3e8fb1c = <0x7e43b1c, 0x7e43b1d, 0x7e43b1e, 0x7e43b1f> Index = 0x9
    Last-Modified: 11/08/2018 09:13:26 PM TableId: 0x600'0 VirtualClock: 0x29f83 TreeUpdateClock: 0x0
    • the salvage/copy command on this file (refsutil salvage -SL  R: C:\Temp\ReFSUtilWorkingDir T: C:\Temp\ReFSUtilWorkingDir\foundfiles.97053A74.selected4salvage.txt -v) actually started restoring the first 29GB of the missing vhdx file! But then:
    Processing C:\Temp\ReFSUtilWorkingDir\foundfiles.97053A74.selected4salvage.txt
    547271 container table entry pages processed (3 invalid page(s)).
    1 container index table entry pages processed (0 invalid page(s)).
    Copying: \\?\T:\volume_97053a74\Extended Data Archive@dParity.vhdx...Warning: Cannot enumerate file extents for source file!
    Warning: A data integrity checksum error occurred. Data in the file stream is corrupt.
    Warning: Cannot copy data stream!
    Warning: A data integrity checksum error occurred. Data in the file stream is corrupt.
    Command Completed.
    Run time = 1280 seconds.
    Great, we already knew that. And there does not seem to be any option for "restore ignoring any corruptions for manual salvaging/recovery". :/
    Saturday, December 29, 2018 10:40 AM
  • I may have found the origin of the corruption after all: In an older system event log file, many "has been surprise removed" disk events (ID 157) were logged for physical disks of the storage pool containing the above dual parity storage space, briefly after disk events like "The device, \Device\HarddiskX\DRYYY, is not ready for access yet" (ID 15). All events were logged in a few-seconds time window after resuming from suspend-to-RAM (triggered previously by a UPS). No hard error like inaccessible files surfaced at this time. I was able to reproduce these events by manually sending the server to sleep and then resume. I also found the hardware origin of these surprise removal events after resume: The hard disks were not directly connected to the two LSI controllers as reported above, but one controller had a two-cable connection to a HPE 24 Bay SAS Expander Card. As the controllers had enough ports for a direct connection of the current disk array, I have now removed this expander card and tested several short power outages (sleep/resume cycles): no more surprise removal events.

    While this may explain the root cause of this corruption instance, the situation detailed above is still "not very advertising"; to sum up: ReFS with enabled file integrity (and the unchanged default setting for the "enforced" flag) currently automatically removes a file for which it has found a corruption that it could not automatically heal. (Given dual parity redundancy, it should have been able to repair it.) From then on, this file lives in an inaccessible "namespace" according to cited system log events, still occupying space. But there is no documented way to gain read access again for manual recovery! This is especially problematic and seems to defy the very design purpose of resiliency for huge vhdx files, where most inner files could probably have been easily restored. (But also small docx files, for example, must not be made permanently inaccessible by a file system decision.) The newly added refsutil.exe seemed promising, but it does not yet contain any command for accessing the "corruption namespace" of the ReFS volume and is poorly documented. The lack of technically competent MS support in this TechNet forum speaks for itself. Obviously the forum staff has the task to either cite the docu or reroute to paid support, but lacks the permission to internally escalate important problems to the right people with access to the actual ReFS source code. Given this situation as of January 2019, I sadly can no longer recommend ReFS for production use, not even in a non-commercial uni/edu environment.

    Friday, January 4, 2019 11:05 PM
  • I've been fiddling around with refsutil.exe for a while now, but came to the same conclusion as Michael. While I did have the ability to partly 'recover' an 'unavailable to the namespace' file, I still haven't found a way to actually free up the space it uses. I don't have S2D, just a stand-alone ReFS volume. Maybe with proper mirrored or parity S2D the corrupted COULD have been repaired, however the more I search for it, the more people I find with corruption that's not repairable at all. That's bad, given that S2D is somewhat supposed to be a scale-out replacement for hardware RAID setups. RAID setups are there to recover from disk crashes and provide redundacy and resiliency, and can correct flipped bits (if we have enough parity that is).

    As said I have a stand-alone ReFS volume, which now has several terabytes of space in use by a file I can't delete or access at all. No documentation, no tools, no help from MS whatsoever.

    I've cut the cord though; we've ordered 120TB storage to replace this 12TB ReFS volume. We'll go with NTFS for as long as it fits. And when it fills up, we'll use NTFS deduplication, with which we have much more control than ReFS block-cloning (we are using this as a Veeam repository, for which S2D with parity is really overdone for us anyway in terms of cost). With dedupe we can say dedupe all files older than x days for example, leaving the last backup chain fast because it's mostly sequential.

    No ReFS for us anymore, I've had enough of it for the forthcoming years :-)


    Tuesday, January 8, 2019 12:56 PM
  • So Michael, it's been a year.

    Did you get ever access to your data?

    Did you ever find a Microsoft command line tool, a powershell applet, or a 3rd party tool, to bring damaged files back online?

    Did you ever find a way to actually delete the lost 17 TB file to recover the space?

    Did Microsoft ever respond?

    Did ReFS ever become a resilient file system?


    Sunday, November 17, 2019 2:11 AM
  • The first four answers are no, but I stopped searching for a tool already this January, as I killed this storage space after my reply/summary from January 2019 in this thread. With respect to you last question, I can only reply "I hope so" and tell you how I handle it at the moment (see my next post at the end of this thread).
    Sunday, November 17, 2019 1:12 PM
  • So Michael, it's been a year.

    (...)

    Did ReFS ever become a resilient file system?

    I can only reply "I hope so" and tell you how I handle it at the moment:

    • hardware level: on host level, I have 9 HDDs and 3 write-durable SSDs, all direct attached to HBAs (no SAS expander card any more, otherwise the same hardware as above)
    • storage spaces level: all these disks form a single-node storage space (the journal tier is a 1col*3SSDs two-way mirror and the capacity tier has a 9col*9HDDs dual parity layout). I trust the redundancy level and disk-loss repair capabilities of storage spaces below the file system level, although the underlying Sp** driver has caused other issues. See has memory leak problems reported earlier this year in the other thread at spaceportsys-nonpagedpool-memoryleak. (Btw., MS did not seem eager to engage/make good use of this community bug report either. So, MS certainly seems not yet to be the type of "learning enterprise" Nadella wants to form imho...)
    • file system on host: Again ReFS. Primarily because I wanted to test block cloning savings together with Veeam Community backup repositories. BUT and importantly, I have FileIntegrity disabled on this volume, i.e. no log structure file system tracking corruption with file integrity streams for 17TB vhdx files any longer... I assume that a similar corruption in a large vhdx file would now simply not be detected on host level.
    • file system in VM: I have an archive/file server VM with a large simple volume (the new 17TB vhdx...). In this VM, I use ReFS with enabled FileIntegrity streams again, i.e. separately from storage spaces and the (maybe undertested) corruption self repair logic. I also set the enforce flag to false for all files/folders on this volume. The idea/hope is that if whatever corruption makes its way through the dual-parity disk on host level into the VM, ReFS will at least detect it there on the "granular small files level" and somehow warn me (probably just log entries). As the enforce flag is false, I hope that it would no longer make affected files totally inaccessible.
    • verdict: So far, no more corruption has occurred; so I can't say for certain that this design will work in the error case. But no corruption or problematic disk or ReFS events since January, although I even enabled scheduled suspend-to-RAM and resume events on this backup host (originally for reliability testing, now I leave it on for some power saving), is at least some good news.
    Sunday, November 17, 2019 1:20 PM
  • We've been hit by a similar problem supposedly: After a power outage with automatic shutdown initiated by our UPS 3 days ago, 2 out of 6 VHDX files on our Storage Pool with Dual Parity ReFS Virtual Disk are inaccessible.

    Storage Pool disks: 11x4TB HDD + 2x500GB SSD DAS

    Virtual Disk: Dual Parity with 9 Columns, 16GB SSD WBC

    Volume on Virtual Disk: ReFS

    Apart from marketing slogans "ReFS is so amazing" and overstrained forum moderators there is not much docu about ReFS out there. Don't use ReFS in production if you're not personally connected to the guy at Microsoft who wrote ReFS.sys.

    Wednesday, March 25, 2020 3:01 PM
  • Sorry to hear that, I have been tracking this thread, please reply with any useful info you come across from your problem regarding recovery.

    -Derek

     
    Tuesday, March 31, 2020 4:52 AM