none
Quorum disk NTFS errors RRS feed

  • Question

  • Hi,

    Just the other day I added a second node to a Hyper-V cluster with a quorum disk.  The cluster appears to be working fine, no errors in the failover clustering manager, I can live migrate VMs just fine, and all CSVs are accessible and online.  However, on the new node I'm seeing floods of event ID 55s and 57s for just the quorum disk, the CSVs are fine.  The new node is NOT the owner of the quorum disk.  The errors state the following:

    - The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume

    - The system failed to flush data to the transaction log. Corruption may occur.

    It seems to me since the other node has a lock on the quorum disk LUN the second one can't access it properly.  In disk management on the second node the disk shows as RAW.  On the first node it shows properly as NTFS.  In failover cluster manager the quorum disk shows no problems and online.  What am I doing wrong?  Should I "offline" the quorum disk on the second node so it's not trying to access it?  Do I need to make some change on the SAN?  Failover validation checks came back fine before I added the second node to the cluster.

    Thanks,

    Todd

    Monday, February 27, 2012 3:54 PM

Answers

  • Something on that node "saw" the disk and is likely trying to access the disk. The host likely has some data in cache that it wants to flush down to disk. I'm guessing that you brought the disks online on that second node while the cluster was up and running on the first node...which you should not have done. Assuming this was the case, some application on the second node "thought" that the disk was accessible and has something in cache that it wants to write to the disk. It also was trying to read the disk, but since there was a SCSI reserve on the disk, it cannot read it properly. Well, that's my theory anyway.

    I'd suggest rebooting the second node. If the errors still occur after a reboot, check you HBA drivers to ensure that they are up to date.


    Visit my blog about multi-site clustering

    • Marked as answer by ToddNel1561 Monday, February 27, 2012 11:14 PM
    Monday, February 27, 2012 10:51 PM
    Moderator

All replies

  • Couple of thoughts:

    Windows clustering has a shared nothing model so only active owner will online and write to disk.  It's normal to see the active owner with disks online and passive node shows disks as "reserved" etc.  I have seen cases where this status is not entirely on accurate on passive node because sometimes explorer shell is cached.  Some of your observations are by design and are OK.

    When you presented new quorum disk did you attempt to online the disk and work with it on both nodes before adding it ask a disk resource in cluster administrator?  That would cause file corruption.

    You can use chkntfs command and check volumes to see if dirty flag is set.

    For example:

    C:\>chkntfs C:
    The type of the file system is NTFS.
    C: is not dirty.

    Which disk is reporting dirty?  When you failover the disk in question or offline/online Windows will attempt to repair the disk by running chkdsk and write the log to application event log.

    You can move the cluster group containing quorum\witness disk with this command from owner of that resource.

    cluster . group "cluster group" /move

    If you don't know who owns the resource:

    cluster . res


    Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog

    Monday, February 27, 2012 5:31 PM
  • Running the chkntfs command on the node that owns the quorum disk says it's not dirty.  Running it on the second node that's having the issues says the type is NTFS but it "cannot query the state of the drive".  Both nodes show the disk as "reserved" in windows disk manager.

    This cluster was originally created with just one node in it because I didn't have the second one available to add from the beginning.  I created the quorum disk using the original one node, formatted as NTFS and added it as a disk resource in failover clustering.  A few weeks later I added the second node when it became available.  So I don't think I introduced any corruption in that process.  And even so the utility you reference didn't seem to find any issue.

    So I guess my question is, is this normal?  It doesn't seem like I should be seeing all these errors in the logs, so what do I need to change?

    Thanks,

    Todd

    Monday, February 27, 2012 5:49 PM
  • I had meant run chkntfs command from the server which owns the disk resources in question. Since the other nodes does not own the disk resources accessing the disk will be blocked which is by design.

    Do you have an indication of which disk is raising the alerts? If you look in the application event log are you seeing that chkdsk ran at some point in time?

    Are the errors still being raised in a continuous manner or is it from a past event where chkdsk remediated corruption or suspicion of corruption.


    Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog

    Monday, February 27, 2012 6:43 PM
  • It's definitely the quorum disk, as per my original post.  Event ID 55 states which disk is raising the error.  I see no indication a chkdsk ran, on either node.  The last error was raised over an hour ago.  They seem to happen in rashes where I'll see 100 of them in a 10 minute period then nothing for several hours to a day.  But it definitely hasn't cured itself.

    I'm having a hard time wrapping my head around this.  My understanding is only 1 server can reliably access a LUN at any given time, unless you have a technology such as CSV sitting on top running some facilitation.  Otherwise you run the risk of corruption.  But since the quorum drive doesn't use CSV how is it functioning without corruption?  Is there some MS magic going on behind the scenes to only allow one node to access it at a time?  To me what seems to be happening is what I'd expect.  One node has a lock on the quorum drive, a second node tries to access it but it's locked.  It's sees a drive there but can't read any partition information, it assumes it's corrupt and damaged, hence the errors.  So it would seem the errors are erroneous and I could ignore them, I just have a hard time believing MS would allow the logs to flood with erroneous messages.

    Todd

    Monday, February 27, 2012 7:08 PM
  • Yes, there is magic happening in the background.  In Windows 2003 cluster disk driver, clusdisk.sys, was responsible for protecting the disks.  You could view it from hidden devices in device manager.  In Windows 2008 and higher clusdisk.sys communicates with partmgr.sys.  Again this is by design, pretty good screen shot here:

    Introducing Windows Server 2008 Failover Clustering
    http://technet.microsoft.com/en-us/magazine/2008.07.failover.aspx?pr=blog

    The NTFS errors continuing after verifying that chkdsk ran on volume and failed between nodes is not expected.


    Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog

    Monday, February 27, 2012 7:28 PM
  • The way I see it nothing is corrupt, the server that "owns" the disk in question says everything is fine with the disk and reports none of these errors in its logs.  The server that doesn't own the disk and is therefore not in a position to say it's corrupt is complaining.  Why is the second server (the one that doesn't own the disk) even trying to access the disk in question if the "magic" is telling it not to?

    Where should I go from here?  Should I run a chkdsk via the node that owns the disk even though it says all is fine?  Is there any harm in doing that on a running cluster?

    Thanks,

    Todd

    Monday, February 27, 2012 7:38 PM
  • Have you moved over cluster group which contains disk to node which is raising events? See if once you do that alerts cease.  If the dirty bit is set on a disk, chkntfs would report it's dirty.  So when the disk is failed over, Windows will poll this attribute, if it's found to be dirty it will run chkdsk for you.  If the disk is found clean it will simply be brought online.  Moving the cluster group containing the quorum disk will automate that task.  I do not have an immediate answer on why alert is being raised.  I am trying to help filter what is normal and expected versus what is not.

    C:\>cluster . group "cluster group" /move


    Dave Guenthner [MSFT] This posting is provided "AS IS" with no warranties, and confers no rights. http://blogs.technet.com/b/davguents_blog

    Monday, February 27, 2012 8:03 PM
  • Something on that node "saw" the disk and is likely trying to access the disk. The host likely has some data in cache that it wants to flush down to disk. I'm guessing that you brought the disks online on that second node while the cluster was up and running on the first node...which you should not have done. Assuming this was the case, some application on the second node "thought" that the disk was accessible and has something in cache that it wants to write to the disk. It also was trying to read the disk, but since there was a SCSI reserve on the disk, it cannot read it properly. Well, that's my theory anyway.

    I'd suggest rebooting the second node. If the errors still occur after a reboot, check you HBA drivers to ensure that they are up to date.


    Visit my blog about multi-site clustering

    • Marked as answer by ToddNel1561 Monday, February 27, 2012 11:14 PM
    Monday, February 27, 2012 10:51 PM
    Moderator
  • Thanks John, I agree and what you're saying makes sense.  I just rebooted the second node and so far no errors, even when I fire up the disk management MMC, which always fired off another flurry of 55 and 57 errors.  I also now see that the quorum disk on the second node shows as offline in the disk mgmt mmc, which is what I would have expected.  I see the error in my ways now, but would the cluster have passed validation if the second/new node that I was trying to add could not access any of the shared disks at the time of addition?  My thinking was that the failover clustering system would need to confirm that all the necessary LUNs are accessible by the new node before it would let it join.


    Thanks again for your succinct and accurate diagnosis... Reboot! ha

    And thanks to Dave as well for your insight.

    Todd

    Monday, February 27, 2012 11:16 PM
  • Did you ever find a resolution to this?  I'm running into the very same issue today.

    Thanks,

    Rob

    Tuesday, May 1, 2012 12:26 AM
  • Yeah just read my last post and the one marked as answer by John Toner.  Basically, my mistake was I brought the quorum disk online on my new, second node while the cluster was active on the first node.  This allowed the second node to try and directly access the disk but it was having problems since the first node had a lock on it.  The fix was simply to reboot the second node.  When it came back up it was part of the cluster and wasn't trying to directly access the disk, it was doing it through cluster services if necessary.

    Hope this helps,
    Todd
    Tuesday, May 1, 2012 12:10 PM