none
2008 R2 Hyper-V Cluster using CSV and Equalogic SAN - NTFS corruption across 5 VM's RRS feed

  • Question

  • On Tuesday January 17th 5 of our SQL servers hosted on an Equalogic PS3000 Array stopped working.  Upon further investigation it was determined that the Windows Application Log and SCOM showed that the NTFS file system had been corrupted.  

     

    All servers showed the alert  at 11:50 AM “NTFS file system had been repaiblack” 

    At 11:51 one minute later the SQL LDf files were totally corrupted on all 5 servers. 

    After opening tickets with Microsoft to assist in rebuilding the databases we determined to much damage had been done to repair.

     

    All SQL servers are using Hyper-V running on a 4 node Cluster. 

    All servers are equipped with an E Drive for their data partition.  This partition also contains the SQL Data and log files.

    All servers E Partition is contained in a single CSV the entire Hyper-V environment is hosted on 2 Equalogic PS3000 – Raid 50.

    Equalogic Array Firmware V5.1.2 (R197668)”

    All Servers have DPM performing SQL backups.

    Diskpart – All notes configublack to offline shablack

    DPM 2010 being used for SQL backups and snaps. DPM using Equalogic hardware snaps

     

    This was really a mess and took several days to get things back to normal.   I am very concerned that this could happen again since we have no clue what caused this.  I have never experienced anything like this.

     

    I am looking for any guidance as to what could have corrupted the SQL LDF files on all 5 servers almost instantly.   This environment has been running flawlessly for over 6 months.  There was nothing out of the ordinary occurring, this happened in the middle of the work day.

     

    I don’t know if this is related to Hyper-V however I figublack I would post this in here in case anyone has any guidance.

     

    I am opening a ticket with Equalogic as well.



    • Edited by Vincent HuModerator Friday, January 20, 2012 2:17 AM change font color
    • Edited by Cerw1n Friday, January 20, 2012 3:47 AM
    Friday, January 20, 2012 1:27 AM

All replies

  • To be honest this seems actually related to some fault in the Storage array, stale data returned from cache or something similar. Did you found nothing in the storage logs?


    /Mat

    Tuesday, May 15, 2012 12:54 PM
  • I second that. it really seems to have been caused by the EQ SAN.

    they have an excellent support, so I would open a case with them and have them pull the logs and analyze them and I am sure they can figure it out.


    Mohsen Almassud

    Saturday, June 9, 2012 2:50 AM