locked
SQL 2005 Cluster Issue RRS feed

  • Question

  • Hi,
    We have a Active/Passive cluster set up with 7 instances of  SQL 2005 SP4 Enterprise(64 bit).
    Daily at 2 AM the SQL Cluster group is restarting on same node or failing over to the otehr node with errors starting with:

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file  in database.  The OS file handle is 0x0000000000000B98.  The offset of the latest long I/O is: 0x000012b0dfe000

    /*there are many occurences of the above message on many instances and particularly msdb database*/

    Event Id 4156 MSDTC - String message: ProcId = 0x2188 CSO: Maintain session; Received E_CM_SERVER_NOT_READY. 

    Event ID 19019
    [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 79; message = [Microsoft][SQL Native Client]TCP Provider: The semaphore timeout period has expired.

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 79; message = [Microsoft][SQL Native Client]Communication link failure

    [sqsrvres] OnlineThread: QP is not online.

    [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    Above errors are logged in the system logs and all the sql cluster groups are either restarting or failing over to other node every day from past one week  around 2AM.

    Backup jobs are scheduled to run at 2 AM on 4 instances same time every day.Could you please tell me if this is related to the backup jobs.We have different drives(for every instance and also different data/log/tempdb drives).
    If this is not due to these backup jobs,could you please tell me how i can troubleshoot on this.I am suspecting a memory/io bottleneck but why is it failing now only which is running from from past few years.Please tell me which performance counters i can collect to resolve this issue.
    Thursday, July 25, 2013 2:30 PM

Answers

  • Even if the sql server crashes and failover to other node,i have seen that everything will be logged to error log.There are no messages in sql errorlog why it is shutting down(due to a stop request or user) just the last message logged saying backup completed for database and new sql errorlog generates.

    Windows - Delayed Write Failed : Windows was unable to save all the data for the file D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere. 

    Can this happen without any messages in sql server errorlog.

    Shanu,

    I am almost 100 % sure this is SAN or SAN network issue.I faced same error message previous year and to be double sure locked case with MS and Storage to find out issue and my SQl server was in cluster..MS vindicated my point by pointing out that this is storage issue.My cluster and sql services were fluctuating also and i was getting same error message.

    So now you have to ask ur boss to agree to you or u can open case to MS and storage and finally prove your point.


    Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers

    • Proposed as answer by Fanny Liu Thursday, August 1, 2013 9:23 AM
    • Marked as answer by Fanny Liu Monday, August 5, 2013 12:24 PM
    Friday, July 26, 2013 1:56 PM

All replies

  • Hi,

    Please, check your storage health ASAP (not just a CHKDSK, run a hardware disk check as well).

    Last time I saw these errors, they were due to failing disks on the SAN.


    Sebastian Sajaroff Senior DBA Pharmacies Jean Coutu

    Thursday, July 25, 2013 2:33 PM
  • In addition to what Sebastian has already mentioned, try changing the schedule of your backup jobs and possibly break them up in smaller chunks. This is to isolate the burst of IO on your storage during those backup schedules that may be causing these issues. If all 7 instances run their backup jobs at 2AM all at the same time, the storage gets saturated immediately. Of course, this is not a substitute for really identifying and correcting the issue with your storage

    Edwin Sarmiento SQL Server MVP
    Blog | Twitter | LinkedIn

    Thursday, July 25, 2013 3:04 PM
  • >>sqsrvres] printODBCError: sqlstate = 08S01; native error = 79; message = [Microsoft][SQL Native Client]Communication link failure

    I cannot say its exactly SAN issue because in my case it was SAN network( HBA controllers )so i think you require through check not only for SAN but Storage Network likeHBA, HBA BIOS/Firmware/driver, Controller firmware, FC Switch firmware etc.


    Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers

    Thursday, July 25, 2013 3:37 PM
  • What is happening at 2:00 AM.check the server event logs,SQL Server logs.SQL Agent,remote jobs...

    Here Communication link failure you mentioned ,so might be it is a network issue. 

    Thursday, July 25, 2013 3:58 PM
  • I am not seeing any other errors in SQL log except for below mesage:

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file  in database.  The OS file handle is 0x0000000000000B98.  The offset of the latest long I/O is: 0x000012b0dfe000

    and then message non-yielding scheduler appears with memory dump,sql instance restarts.In the Agent Log,there were same errors as i mentioned above:

    [298] SQLServer Error: 121, TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01]
    [298] SQLServer Error: 121, Communication link failure [SQLSTATE 08S01]
    [298] SQLServer Error: 16389, Communication link failure [SQLSTATE 08S01]

    Thursday, July 25, 2013 5:32 PM
  • Hi,

    Definitely looks like a hardware issue to me.

    Please, check disks and fiber channel between your nodes and your storage.


    Sebastian Sajaroff Senior DBA Pharmacies Jean Coutu

    Thursday, July 25, 2013 5:40 PM
  • Also when the failover happens,Jobs were cancelled due to a restart and data was not saved to SQLAGENT.OUT and errorlog.

    Below errors were  logged only during failover and not  when the instance restarts on same node:

    Events 37(mpio) and Event 57 The system failed to flush data to the transaction log. Corruption may occur.

    Event 50 {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

    Thursday, July 25, 2013 5:53 PM
  • How can i confirm and convince my boss that this is a storage issue.Can you tell me if there are any performance counters that i can collect to support my points or in any other way.
    Thursday, July 25, 2013 5:56 PM
  • Hi,

    Well, the previous error message looks pretty scary to me.

    Most SAN and NAS come with some kind of hardware diagnosis tools, please read their documentation

    and give them a try.

    Check also the fiber channel between your storage and your nodes, could be flaky as well.


    Sebastian Sajaroff Senior DBA Pharmacies Jean Coutu

    Thursday, July 25, 2013 6:01 PM
  • I have collected the performance counters and found that Page life expectancy to a value of "10" at this time with sudden increase in Buffer Manager :Pages/sec to a value of "14500",page IO Latch waits "15",%page file usage "90". Could someone tell me how i can troubleshoot this further.
    Friday, July 26, 2013 12:12 PM
  • I dont think anybody can point out concrete thing without knowing ur CPU usage and RAM but for one thingh I am pretty sure You PLE is very low.Which might indicate memory crunch

    Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers

    Friday, July 26, 2013 12:32 PM
  • Available memory is dropping from 22 GB to 18 GB(Available MBytes)

    Is this low PLE because of the jobs that are running at this time (or) this might also happen because of malfunctioning storage system?

    Friday, July 26, 2013 12:49 PM
  • Hi,

    Low PLE is usually related to the running jobs, not to hardware issues.


    Sebastian Sajaroff Senior DBA Pharmacies Jean Coutu

    Friday, July 26, 2013 12:50 PM
  • We have only maintenance jobs running at this time(Backup/Index/Stats)  on 4 instances and the size of the DBs are in MB on 2 instances.

    And databases went to RECOVERY_PENDING state with log file corruption(Error 5173) and the below errors logged at the time of failover:

    Events 37(mpio) and Event 57 The system failed to flush data to the transaction log. Corruption may occur.

    Event 50 {Delayed Write Failed} Windows was unable to save all the data for the file . The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere.

    Even if the sql restarts on the same node,some data is getting lost like job history,error log etc.I would like to know why the data is getting lost.

    Friday, July 26, 2013 1:03 PM
  • Hi,

    You're losing data because there's corruption at the file level.

    Where does corruption come from?

    1. SQL crashing in the past because of low memory available (quite rare but not impossible).

    2. Physical issues at storage level?

    3. Both of them?

    Even if PLE is usually related to low memory, the kind of errors you're having point to file corruption,

    which can be logical (SQL crash in the past) or physical (failing disks, controllers, cache, etc)


    Sebastian Sajaroff Senior DBA Pharmacies Jean Coutu

    Friday, July 26, 2013 1:08 PM
  • Even if the sql server crashes and failover to other node,i have seen that everything will be logged to error log.There are no messages in sql errorlog why it is shutting down(due to a stop request or user) just the last message logged saying backup completed for database and new sql errorlog generates.

    Windows - Delayed Write Failed : Windows was unable to save all the data for the file D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere. 

    Can this happen without any messages in sql server errorlog.

    Friday, July 26, 2013 1:35 PM
  • Even if the sql server crashes and failover to other node,i have seen that everything will be logged to error log.There are no messages in sql errorlog why it is shutting down(due to a stop request or user) just the last message logged saying backup completed for database and new sql errorlog generates.

    Windows - Delayed Write Failed : Windows was unable to save all the data for the file D:\Microsoft SQL Server\MSSQL.1\MSSQL\LOG. The data has been lost. This error may be caused by a failure of your computer hardware or network connection. Please try to save this file elsewhere. 

    Can this happen without any messages in sql server errorlog.

    Shanu,

    I am almost 100 % sure this is SAN or SAN network issue.I faced same error message previous year and to be double sure locked case with MS and Storage to find out issue and my SQl server was in cluster..MS vindicated my point by pointing out that this is storage issue.My cluster and sql services were fluctuating also and i was getting same error message.

    So now you have to ask ur boss to agree to you or u can open case to MS and storage and finally prove your point.


    Please mark this reply as the answer or vote as helpful, as appropriate, to make it useful for other readers

    • Proposed as answer by Fanny Liu Thursday, August 1, 2013 9:23 AM
    • Marked as answer by Fanny Liu Monday, August 5, 2013 12:24 PM
    Friday, July 26, 2013 1:56 PM
  • Wanted to share,some of the issue in My environment I have seen -

    1.There Drivers of the Hardware is not up-to-date.still it is running through your older versions.

    2.The SQL server & databases are in high utilization due to Ad-hoc or your load on the database is high.

    3.your H/W  issues,like where the database file resides.


    Thanks, Rama Udaya.K (http://rama38udaya.wordpress.com) ---------------------------------------- Please remember to mark the replies as answers if they help and UN-mark them if they provide no help,Vote if they gives you information.

    Sunday, July 28, 2013 2:08 PM