20. ledna 2012 1:27
On Tuesday January 17th 5 of our SQL servers hosted on an Equalogic PS3000 Array stopped working. Upon further investigation it was determined that the Windows Application Log and SCOM showed that the NTFS file system had been corrupted.
All servers showed the alert at 11:50 AM “NTFS file system had been repaiblack”
At 11:51 one minute later the SQL LDf files were totally corrupted on all 5 servers.
After opening tickets with Microsoft to assist in rebuilding the databases we determined to much damage had been done to repair.
All SQL servers are using Hyper-V running on a 4 node Cluster.
All servers are equipped with an E Drive for their data partition. This partition also contains the SQL Data and log files.
All servers E Partition is contained in a single CSV the entire Hyper-V environment is hosted on 2 Equalogic PS3000 – Raid 50.
Equalogic Array Firmware V5.1.2 (R197668)”
All Servers have DPM performing SQL backups.
Diskpart – All notes configublack to offline shablack
DPM 2010 being used for SQL backups and snaps. DPM using Equalogic hardware snaps
This was really a mess and took several days to get things back to normal. I am very concerned that this could happen again since we have no clue what caused this. I have never experienced anything like this.
I am looking for any guidance as to what could have corrupted the SQL LDF files on all 5 servers almost instantly. This environment has been running flawlessly for over 6 months. There was nothing out of the ordinary occurring, this happened in the middle of the work day.
I don’t know if this is related to Hyper-V however I figublack I would post this in here in case anyone has any guidance.
I am opening a ticket with Equalogic as well.
15. května 2012 12:54
To be honest this seems actually related to some fault in the Storage array, stale data returned from cache or something similar. Did you found nothing in the storage logs?
9. června 2012 2:50
I second that. it really seems to have been caused by the EQ SAN.
they have an excellent support, so I would open a case with them and have them pull the logs and analyze them and I am sure they can figure it out.