none
Cluster Disk always wants to run CHKDSK RRS feed

  • Question

  • Hello,

    Although I have already found articles on this subject, I am trying to dig deeper and find the root cause of a situation that started happening recently.

    We have a two node cluster (node and file share majority), Windows 2008 R2 Enterprise (64bit) on both nodes.  We also have SQL 2008 R2 Enterprise installed on the cluster as well.  The nodes have SAN storage attached to them, specifically:

    Cluster Disk 1 (R:)

    Cluster Disk 2 (Q:)

    Cluster Disk 3 (P:)

    Cluster Disk 4 (S:)

    All the disks are Simple, Basic, NTFS.   SQL Server has Cluster Disks 1, 2, and 3 as resources.

    My situation:  Every time the SQL Server service is moved to the other node (via the Failover Cluster UI) or if the node that has ownership of the SQL Server service is rebooted, two of the three disks that are resources for SQL go into an Online Pending state while CHKDSK is being run (see image below).  This happens every single time and in each case, no corruption is found.  Fortunately, this only lasts for about a minute, but I am concerned that there is something else out there that needs to be addressed.

    Like I said, this only started happening about 2 weeks ago, and only happens on 2 of the 3 disks that are resources for SQL.  We have performed several cluster rolls without incident prior to this.

    On the SQL side, there is only 9 db's on the instance.  I have run DBCC CHECKDB on all of them, and they have no issues.  We did have mirroring running on 3 of the DB's, but today I removed that to see if it would make a difference (and it did not).

    Any ideas would be helpful.  Thanks in advance for your help.

    RT

     

    Wednesday, November 30, 2011 4:55 PM

Answers

  • New information:

    Yesterday, we changed permissions on the root folder of the two drives that we are having the issues with.  When we gave Full permissions to the "Local Users" group, the problem went away.  When we reduced that to Read and Execute, the problem did not resurface.

    The problem is, when we compare the perms on the two drives of this cluster to the same drives on another cluster (same hardware configuration, Windows and SQL installation), there was no difference initially.

    Again, any thoughts would be very helpful at this point.

     

    • Marked as answer by RoryTesta Monday, December 12, 2011 7:41 PM
    Tuesday, December 6, 2011 3:16 PM
  • We opened a ticket with Microsoft.  They took copies of our logs for analysis.   By the time they got back to us, we had stumbled upon the permission changes discussed above.   They added that the "NETWORK SERVICE" user needed to have Full permissions as well.

     

    • Marked as answer by RoryTesta Monday, December 12, 2011 7:41 PM
    Wednesday, December 7, 2011 7:42 PM

All replies

  • Hi,

     

    The chkdsk process is possibly re-running itself as it is more than likely not completing without interruption.

     

    I would recommend you manually run chkdsk as per http://support.microsoft.com/kb/176970

     

    Best of luck!

    Martin

     


    If you find my information useful, please rate it. :-)
    • Proposed as answer by Martin G. Evans Wednesday, November 30, 2011 8:04 PM
    • Unproposed as answer by RoryTesta Wednesday, November 30, 2011 9:40 PM
    Wednesday, November 30, 2011 8:04 PM
  • If you have an open handle on the disks at the time of failover, this will leave the disks in a "dirty" state and will require a chkdsk to be run. This would indicate that some application outside of the cluster has a handle on the disk when you are failing over. Examples of this are monitoring agents or antivirus software.

    You might consider running a CHKNTFS command before running your failover to see if the volume is dirty before the failover is attempted. If its not dirty, then you know for sure that some application has a handle on the disk when you are failing over. You'd then just need something like handle.exe from Sysinternals to help identify which process is causing the issue.

    Hope this helps.


    Visit my blog about multi-site clustering - http://msmvps.com/blogs/jtoner
    Wednesday, November 30, 2011 8:17 PM
    Moderator
  • Hi Martin,

    The article you proposed is for

    • Microsoft Windows Server 2003, Datacenter Edition for Itanium-Based Systems
    • Microsoft Windows Server 2003, Enterprise Edition for Itanium-based Systems
    • Microsoft Windows Server 2003, Datacenter Edition (32-bit x86)
    • Microsoft Windows Server 2003, Enterprise Edition (32-bit x86)
    • Microsoft Windows 2000 Advanced Server
    • Microsoft Windows 2000 Datacenter Server
    • Microsoft Windows NT Server 4.0 Enterprise Edition

    and I am running Windows 2008 R2 Enterprise.

    RT

     

    Wednesday, November 30, 2011 8:55 PM
  • Hi John,

    We suspected that, and we uninstalled anti-virus on both nodes as a precaution.  Unfortunatley the situation remained, even after doing a full reboot of both nodes.

    We will try CHKNTFS prior to cluster failover and see if the volume is dirty.

    RT

     

    Wednesday, November 30, 2011 8:58 PM
  • So I ran CHKNTFS on both drive R and P (the 2 volumes that want to run CHKDSK every time the cluster is failed over).  It said both drives were not dirty.  I ran handle.exe, and sent the output to a text file.  I searched through the file and the only files that had processes (that  were not databases or log files) were these:

    P:\$Extend\$RmMetadata\$Txf System pid: 4 \<unable to open process>
    P:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf System pid: 4 \<unable to open process>
    P:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001 System pid: 4 \<unable to open process>
    P:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002 System pid: 4 \<unable to open process>
    P:\MSDTC\034730cb-9638-4720-aa82-4148f8edf20b\MSDTC.LOG msdtc.exe pid: 4280 NT AUTHORITY\NETWORK SERVICE
    R:\$Extend\$RmMetadata\$Txf System pid: 4 \<unable to open process>
    R:\$Extend\$RmMetadata\$TxfLog\$TxfLog.blf System pid: 4 \<unable to open process>
    R:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000001 System pid: 4 \<unable to open process>
    R:\$Extend\$RmMetadata\$TxfLog\$TxfLogContainer00000000000000000002 System pid: 4 \<unable to open process>

    Unfortunately, this does not help me determine why CHKDSK wants to run every time on these drives.

    Any other suggestions would be extremley helpful at this point.

    RT

     

    Wednesday, November 30, 2011 9:38 PM
  • Hi RoryTesta,

    I have been in a similar issue (similar not exactly) and i decided do not repair the disk, but to rebuild it.

    I can give you some advises if you want to do them:

    - Stop the SQL Services and stop the SQL Cluster resources

    - Use Robocopy /e /copyall to copy all the data of these two disks in an external destination (make two copies to be sure)

    - Remove the problematic disks from the cluster storage (

    - Format them

    -Re-add them to the cluster

    - Copy again (using robocopy/e /copyall) the files to their relevant disks

    - Start the SQL Server resources and services

    Robocopy


    Regards, Samir Farhat Infrastructure Consultant
    Wednesday, November 30, 2011 11:06 PM
  • What process is PID 4? You should be able to see it in Task Manager if you set the PID column. 

    Perhaps a silly question, but do the SQL resources actually depend upon the disks? Does the DTC resource depend on the disks? Double check your dependency tree

    Another thing you might try would be to take SQL resources offline, then manually move the cluster group and see if it still needs the chkdsk. If not, then there's something in SQL that isn't going offline properly. If it does, it just proves that some other application in the environment is causing this.

    Another test might be to disable the ChkDsk test by adjusting the "DiskRunChkDsk" value. Setting this value to 4 will change the behavior so it does not attempt to chkdsk the drive on failover. You can then try a chkntfs and see if it is reporting that the disk is dirty. If its not dirty, then you might need to engage MSFT to see why cluster is thinking that the disk is dirty. To set this value, you could issue a command similar to:

    cluster res "Disk X" /priv DiskRunChkDsk=4


    Visit my blog about multi-site clustering - http://msmvps.com/blogs/jtoner
    Thursday, December 1, 2011 2:24 AM
    Moderator
  • More information I uncovered:

    Like I said, every time the SQL Server service is taken off line (either manually through the Failover Cluster UI or during an automatic failover event like a node being rebooted) CHKDSK runs on 2 of the 3 dirves that are resources for the SQL Server service.  I also see in the System Log this sequence of messages:

    1. The Distributed Transaction Coordinator (034730cb-9638-4720-aa82-4148f8edf20b) service entered the stopped state.

    2.  The SQL Server Agent (MSSQLSERVER) service entered the stopped state.

    3. The SQL Server (MSSQLSERVER) service entered the stopped state.

    4. The time provider NtpClient is currently receiving valid time data from <servername>

    5. The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction.  The CLFS error returned was: %4.

    6.  The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction.  The CLFS error returned was: %4.

    7. The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction.  The CLFS error returned was: %4.

    8. The time provider NtpClient is currently receiving valid time data from <servername>

    9. Cluster disk resource 'Cluster Disk 1' indicates corruption for volume '\\?\Volume{6f70fa75-04b4-11e1-ab0e-0050569c0075}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:\Windows\Cluster\Reports\ChkDsk_ResCluster Disk 1_Disk1Part1.log'.
     Chkdsk may also write information to the Application Event Log.

    10. Cluster disk resource 'Cluster Disk 3' indicates corruption for volume '\\?\Volume{6f70fa87-04b4-11e1-ab0e-0050569c0075}'. Chkdsk is being run to repair problems. The disk will be unavailable until Chkdsk completes. Chkdsk output will be logged to file 'C:\Windows\Cluster\Reports\ChkDsk_ResCluster Disk 3_Disk3Part1.log'.
     Chkdsk may also write information to the Application Event Log.

    11. The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction.  The CLFS error returned was: %4.

    12.  The Transaction (UOW=%1, Description='%3') was unable to be committed, and instead rolled back; this was due to an error message returned by CLFS while attempting to write a Prepare or Commit record for the Transaction.  The CLFS error returned was: %4.

    So we are seeing this "The Transaction (UOW=%1, Description='%3') was unable to be committed" message right before and right after the message about running CHDSK every single time.  Unfortunately, I can't find much info out there about this message, but it looks like it may be contributing to why CHKDSK keeps running.

    Any thoughts?

    RT

     

    Thursday, December 1, 2011 5:23 PM
  • Still more information:

    We stopped the SQL Server application through the Failover Clutering UI.  Then we ran Validate This Cluster.  One of the Storage tests FAILED, giving this failure message for all 3 of the drives that we set up as a resource for SQL:

    "Failed to validate file data on cluster disk 2 partition 1, failure reason: The disk structure is corrupted and unreadable."

    We are in the process of running CHKDSK on the disks to look for corruption.  I doubt that it will find anything however, since all of the instances where the cluster ran it found nothing.

    Any thoughts?

    RT

    Thursday, December 1, 2011 10:37 PM
  • Hi,

    Apologies for earlier article!

    fsutil dirty query x: (where X is the drive in question)

    Is the storage Fibre or ISCSI attached?

     

    Kind Regards,

    Martin

     



    If you find my information useful, please rate it. :-)
    Monday, December 5, 2011 3:18 AM
  • Hi Martin,

    The drives are fiber channel connected.  The SAN is NetApp, but unfortunately that is all I know.

     

    Monday, December 5, 2011 1:48 PM
  • Wow, this sounds a bit extreme.  And don't you think that this would fix the symptom, and not the problem?  I guess I am trying to locate just what caused this.  There are several articles that point to why Windows would want to launch CSKDSK, and when trying each of these manually, I can't see where the problem is.
    Monday, December 5, 2011 1:51 PM
  • HI John,

    Unfortunately, PID 4 is displays as "System", and does not seem to give me any inof beyond that.

     

    Monday, December 5, 2011 1:52 PM
  • New information:

    Yesterday, we changed permissions on the root folder of the two drives that we are having the issues with.  When we gave Full permissions to the "Local Users" group, the problem went away.  When we reduced that to Read and Execute, the problem did not resurface.

    The problem is, when we compare the perms on the two drives of this cluster to the same drives on another cluster (same hardware configuration, Windows and SQL installation), there was no difference initially.

    Again, any thoughts would be very helpful at this point.

     

    • Marked as answer by RoryTesta Monday, December 12, 2011 7:41 PM
    Tuesday, December 6, 2011 3:16 PM
  • We opened a ticket with Microsoft.  They took copies of our logs for analysis.   By the time they got back to us, we had stumbled upon the permission changes discussed above.   They added that the "NETWORK SERVICE" user needed to have Full permissions as well.

     

    • Marked as answer by RoryTesta Monday, December 12, 2011 7:41 PM
    Wednesday, December 7, 2011 7:42 PM