none
DBCC CHECKDB... error state 1

    Question

  •  

    I am migrating our SQL 2000 databases to a SQL 2005 (9.0.3215) 32-bit Window v5.2: Service Pack 1; 3790; IE v6.0.3790.1830 server. During a job that runs DBCC CHECKDB on each database the logs are recording the following error on a few of the 35 databases on this server. They are all set at 2005 compatibility level, CHECKSUM page verify.

     

     

    DBCC encountered a page with an LSN greater than the current end of log LSN (43066:0:1) for its internal database snapshot. Could not read page (4871:-12646656), database 'OurDB' (database ID 40), LSN = (1996515584:-1962900224:40448), type = 85, isInSparseFile = 1.   Please re-run  this DBCC command.

     

    The above appears 1 to 3 times and is followed by

     

    DBCC CHECKDB (OurDB) executed by PHYSFAC\sqlservice terminated abnormally due to error state 1. Elapsed time: 0 hours 0 minutes 41 seconds.

     

    A search of MSDN reveals that Microsoft Customer Service and Support should be contacted... but I was wondering if anyone here has any suggestions on what to do...

     

    I just ran DBCC CHECKDB(OurDB) again as suggested in the error and got the following:

     

    Msg 8967, Level 16, State 216, Server OurServer, Line 1
    An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.
    DBCC results for 'OurDB'.
    Msg 8921, Level 16, State 1, Server OurServer, Line 1
    Check terminated. A failure was detected while collecting facts. Possibly tempdb out of space or a system table is inconsistent. Check previous errors.
    CHECKDB found 0 allocation errors and 0 consistency errors in database 'OurDB'.
    NULL

     

    Thanks.

    Friday, April 18, 2008 12:33 PM

Answers

  •  PUdba wrote:

    <snip>

     

    Msg 8928, Level 16, State 1, Line 1

    Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data): Page (1:143168) could not be processed. See other errors for details.

     

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data), page (1:144696). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 46270473 and -4.

     

    There are 693108 rows in 16198 pages for object "DEADFILES".

    CHECKTABLE found 0 allocation errors and 91 consistency errors in table 'DEADFILES' (object ID 1221579390).

    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (OurDB_ss.dbo.DEADFILES).

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

     

    This table is for long term archival purposes and not vital to the application according to the project manager. So I may have some wiggle room - but was wondering if anyone could tell me the next steps to repair something like this. Or is a repair_allow_data_loss our only option?

     

    (Partial quote of PUdba's post)

     

    Cool - the problem is that you have two corrupt data pages in the clustered index of DEADFILES. A row on each of these pages links to a LOB value (stored on another page). Because the data pages are corrupt and the rows are not processed, CHECKDB does not see the links from the data rows to the LOB pages - hence the two 8964 errors.

     

    If you run repair on this table then you will lose the data stored in the table rows on the two corrupt data pages. Do these corruptions exist when you CHECKDB on 2000? If so, you should repair them on 2000 before migrating to 2005. If not, then my guess is that the hardware you've restored to on 2005 is causing the corruptions.

     

    Kind of complicated to say what the best way for you to proceed is without knowing what your potential for data loss with old backups etc is.

     

    Thanks

    Monday, May 05, 2008 5:10 PM
  •  Paul Randal wrote:

     

    If you run repair on this table then you will lose the data stored in the table rows on the two corrupt data pages. Do these corruptions exist when you CHECKDB on 2000? If so, you should repair them on 2000 before migrating to 2005. If not, then my guess is that the hardware you've restored to on 2005 is causing the corruptions.

     

     

    I don't know if I should cry, celebrate or cuss... but thanks Paul. I restored a clean copy of the production database to the SQL 2000 Dev server, ran repair_rebuild and it found no issues. So I restored a copy of the database to the new SQL 2005 QA server and the new SQL 2005 Dev server <--- this is the one that had the error state 1 in the logs.

     

    Anyway, a DBCC CHECKDB on the new QA server ran clean... it was still throwing the error on the new Dev server. So, looks like it must be the hardware causing the corruptions. The server group (oh my!) is looking at it (and yes it took quite a bit of talking to convince them to look), but in the mean time, would you mind elaborating on your hardware guess? Does the error point to anything in particular like hard drives, memory, raid? Would like to give them a more narrow direction to search because they are saying their diagnostic application is not alerting them with a problem. They are looking to see if there may be a firmware upgrade needed on the drives though.

     

    Again, thank you so much for your time!

    Tuesday, May 06, 2008 4:01 PM

All replies

  • DBCC on SQL 2005 places a lot more pressure on tempdb than it did before.

    Increase the size of your tempdb and retry.

    If DBCC runs and confirms corruption of data pages remember that SQL 2005 allows a page level restore of individual pages from a previous backup, as long as you have all the subsequent transaction log backups as well.

    Don't whatever you do, use the dbcc repair_allow_data_loss option as some people are fond of suggesting - this will just wipe all corrupt pages leaving you with a physically fine but logically useless database, especially if multiple pages are wiped.

     

    Friday, April 18, 2008 6:28 PM
  • Rebuilt all indexes then ran checkdb afterwards. Still getting the same errors. Thanks anyway.

     

    Friday, April 18, 2008 8:26 PM
  • Have you checked the system tables via dbcc checkcatalog?

     http://msdn2.microsoft.com/en-us/library/ms186720.aspx

    Friday, April 18, 2008 8:42 PM
  • tempdb is now huge (we have plenty of room) and it still throws the same errors. We are migrating to 2005 and these databases have just been restored from the old 2000 server to the new 2005... we're still in dev by the way - but they are copies of the production databases (ugh). I will probably go back a the 2000 dev server and try to repair the offending databases there, then try the migrate again.

     

    Any advice would be appreciated though. 

     

    Friday, April 18, 2008 8:45 PM
  •  Ajmer Dhariwal wrote:

    Have you checked the system tables via dbcc checkcatalog?

     http://msdn2.microsoft.com/en-us/library/ms186720.aspx

     

    CheckCatalog was run on each of the databases with the above error. It did not report any errors found or repaired.

    Tuesday, April 22, 2008 2:14 PM
  •  SQLUSA wrote:
     PUdba wrote:

     I will probably go back a the 2000 dev server and try to repair the offending databases there, then try the migrate again.

     

    Any advice would be appreciated though. 

     

     

    That is a good idea.

     

    1. Dbreindex all tables

    2. checkdb

     

     

     

    Well... going back to 2000 did not seem to help a bit. After running the reindex and the repair in 2000, backing up one of the databases, restoring it to 2005, and running DBCC CheckDB again in 2005... got the same errors in the SQL error logs. Looks like I will have to Contact Customer Support Services...

    Tuesday, April 22, 2008 8:39 PM
  • For whatever reason, I recently got this kind of error moving a database from 2000 to 2005, and here is what I did to fix it.

     

    1. On a SQL 2000 server, I set the recovery model from Full to Simple.
    2. I shrunk all of the database files with page reorganization to their smallest possible size.
    3. Then I detached the database and copied the files to the 2005 server.
    4. Then I attached the copy of the files to the 2005 server.
    5. Then I rebuilt all indexes using the following code:

     

    1. Code Snippet

      exec sp_msforeachtable 'ALTER INDEX ALL ON ? REBUILD;'

    Then Run CheckDB.

     

    I have no idea why this fixed my problem, but you might give it a shot.  It takes forever though.

    Wednesday, April 23, 2008 1:04 AM
  • I went ahead and sent this post to Paul Randal who wrote DBCC CHECKDB while he worked at Microsoft.  He answers questions on the forums from time to time.  However he is in Orlando speaking at SQL Connections and won't return until the 30th, so it may take him some time to get to my email.  If you have the time, you can try what I did to fix this issue, and see if it helps resolve yours, or you can wait it out and see what he has to say if he gets to this.  He is probably the best source of information outside of an escalation call with Microsoft Support.

     

    Wednesday, April 23, 2008 1:45 AM
  •  Jonathan Kehayias wrote:

    For whatever reason, I recently got this kind of error moving a database from 2000 to 2005, and here is what I did to fix it.

     

    1. On a SQL 2000 server, I set the recovery model from Full to Simple.
    2. I shrunk all of the database files with page reorganization to their smallest possible size.
    3. Then I detached the database and copied the files to the 2005 server.
    4. Then I attached the copy of the files to the 2005 server.
    5. Then I rebuilt all indexes using the following code:

     

    1. Code Snippet

      exec sp_msforeachtable 'ALTER INDEX ALL ON ? REBUILD;'

    Then Run CheckDB.

     

    I have no idea why this fixed my problem, but you might give it a shot.  It takes forever though.

     

    I gave this a try... same errors appeared in the log. I also shot off an email to Paul Randal and got the automated response that he will return April 30. Guess I will try to move on with all of the other databases and come back to this once I hear from him.

     

    Thank you everyone for trying to help - I really appreciate it.

    Wednesday, April 23, 2008 2:10 PM
  •  

    Thanks Jonathan. As he says, I'm down here in Orlando at the SQL Connections conference so I'm not paying such close attention to the forums.

     

    This error is basically a race condition between the database snapshot code and the NTFS sparse file code. It's linked to the size of the files involved which is why changing them in some way can sometimes get rid of the problem. AFAIK there's no fix available until Windows has a fix in. So - not a bug in DBCC CHECKDB :-) You can try altering the database size and see if that removes the problem.

     

    I saw the advice of 'rebuild all indexes and re-run CHECKDB' in the thread - that's strange advice. Any corruption in a 2005 database has a reason behind it and should be investigated. Without looking at the output from CHECKDB, why immediately rebuild all the indexes? What do you think that's going to solve (apart from masking any non-clustered index corruptions temporarily)?

     

    Hope this helps

    Wednesday, April 23, 2008 2:16 PM
  • Just to explain my post a little further:

     

    The reason that I did an index rebuild is two fold.  First, the shring with page reorganization is going to fragment the database heavily, and second it is a recommended best practice after upgrading to SQL 2005 to rebuild all indexes and update statistics as well. 

     

    I did leave a step out and that was to grow the database to a size that should allow the rebuilding of the indexes to not force automatic file growth, but I did that just to reduce the impact of the cost to grow the database over and over during the rebuild operation.  I grew the database to have 25% free space before I ran the index rebuild.  I guess this makes sense for why my problem disappeared given Pauls information. 

    Wednesday, April 23, 2008 2:37 PM
  • I thought we had it. This worked for one of the databases, but not for 2 others. After doing the above and running DBCC checkdb again I still got the following error:

     

    Msg 8967, Level 16, State 216, Line 1

    An internal error occurred in DBCC that prevented further processing. Contact Customer Support Services.

     

    and the error state 1 message in the logs again.

     

    Is there a KB on the NTFS bug and its ill effect on migrating SQL database? I'd like to understand the "race" thing a little better.

    Thursday, April 24, 2008 2:20 PM
  • How much did you grow the file by?  Try doing this with varying growth of the files and see if it can be resolved by different sizes.  I know that this isn't a really good answer, but I don't know that there really is one.
    Sunday, May 04, 2008 1:51 AM
  •  SQLUSA wrote:
     

    Most corruptions occurs in indexes. Therefore rubuilding indexes is the first (brute force) step to resolve the issue. If corruption still exists after index rebuild, further diagnostics can follow.

     

    and what if these corruptions are in the leaf pages (data pages) of a clustered index ? will rebuild repair it ??

    Sunday, May 04, 2008 10:15 AM
  •  SQLUSA wrote:
    That is doubtful.

     

     

    That is what Paul was trying to say and he is absolutely right (as always ). Rebuilding the indexes isn't a rule of thumb, because for it won't help in cases with severe corruptions as in case of physical corruptions (maybe randomly in other logical corruptions)  . So  checking the error messages is vital to determine the nature of the problem instead of wasting some time in rebuilding the indexes immediately.

     

     

    Monday, May 05, 2008 7:41 AM
  •  Jonathan Kehayias wrote:
    How much did you grow the file by?  Try doing this with varying growth of the files and see if it can be resolved by different sizes.  I know that this isn't a really good answer, but I don't know that there really is one.

     

     

    Just to keep you all informed, I wanted to mention that Paul Randal also stated the following in a separate conversation...

     

    There's no KB article that I know of and the only workaround I've seen work sometimes is to grow or shrink the file. It's a Windows issue with sparse files (which database snapshots use). You could also try creating your own database snapshot and running CHECKDB on that - it's essentially the same process CHECKDB uses internally.

     

    I'm going to try the DBCC against a snapshot of one of the databases. If that doesn't work - I'll try another few size changes as you suggested Jonathan.

     

    Other than that, all that is left at this point is to contact Microsoft Customer Service and Support, as suggested in one of the error messages.

     

    I have another unresolved problem with scheduling replication between SQL 2005 servers that they can't seem to fix. The last I heard from the Microsoft Connect team on that was 3/13/08 - "we are investigating this".

     

    And by the way... I am probably a newbie by most of your standards, but I do routinely schedule maintenance of all of our databases each night - Dev, QA, & Prod. They are small enough and our equipment is robust enough that I can run a re-index, update stats and also check their integrity with DBCC - yes, that is nightly. So it really is not a matter of neglect that caused these problems.

     

    It seems, Paul has identified the issue... I just want to move on. I'm supposed to have all of these databases moved into Prod by the end of June... ugh. So much to do... So little time...

    Monday, May 05, 2008 1:44 PM
  • Running CHECKDB against the snapshot gave me the problem table id:

     

    Msg 8964, Level 16, State 1, Server PF3SQ1NEW, Line 1

    Table error: Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 71856176488120320 (type LOB data). The off-row data node at page (1:62517), slot 15, text ID 28428861440 is not referenced.

     

    Yeah! better details... so then a Checktable gave me (I've deleted the repeated similar messages):

     

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID -7349815011223339008 (type Unknown), page (12594:808792112). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 46270473 and -4.

     

    Msg 8909, Level 16, State 1, Line 1

    Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 3473773127924908032 (type Unknown), page ID (1:144698) contains an incorrect page ID in its page header. The PageId in the page header = (18249:1394628944).

     

    CHECKTABLE found 0 allocation errors and 16 consistency errors not associated with any single object.

    DBCC results for 'DEADFILES'.

     

    Msg 8964, Level 16, State 1, Line 1

    Table error: Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 71856176488120320 (type LOB data). The off-row data node at page (1:34417), slot 5, text ID 24951586816 is not referenced.

     

    Msg 8928, Level 16, State 1, Line 1

    Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data): Page (1:143168) could not be processed. See other errors for details.

     

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data), page (1:144696). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 46270473 and -4.

     

    There are 693108 rows in 16198 pages for object "DEADFILES".

    CHECKTABLE found 0 allocation errors and 91 consistency errors in table 'DEADFILES' (object ID 1221579390).

    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (OurDB_ss.dbo.DEADFILES).

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

     

    This table is for long term archival purposes and not vital to the application according to the project manager. So I may have some wiggle room - but was wondering if anyone could tell me the next steps to repair something like this. Or is a repair_allow_data_loss our only option?

    Monday, May 05, 2008 3:49 PM
  •  SQLUSA wrote:
     Mawla wrote:
    So  checking the error messages is vital to determine the nature of the problem instead of wasting some time in rebuilding the indexes immediately.

     

     

     

    That depends.

     

    When I am helping somebody remotely(likely a newbie), not having access to the server, that is the very first thing I recommend.  Solves the problem in 60% of the cases (my own ballpark).

     

    On my server, obviously, I investigate first.

     

    Wow - that's terrible advice to give out IMHO. Rebuilding the indexes causes lots of transaction log growth, probable data file growth and does nothing to address the cause of the corruption. All you do is mask the problem until the corruption occurs again - and in your 40% of cases, wastes a bunch of people's time. You should only rebuild indexes IF the only problems are with non-clustered indexes and you can be sure the index rebuild will succeed (i.e. no duplicate rows, no corruptions at the leaf level that would break an online index rebuild, etc, etc).

     

    Unless your strategy is to always go straight for restoring from a backup, you should always look at the CHECKDB output before taking any action.

     

    Thanks

    Monday, May 05, 2008 5:02 PM
  •  PUdba wrote:

    <snip>

     

    Msg 8928, Level 16, State 1, Line 1

    Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data): Page (1:143168) could not be processed. See other errors for details.

     

    Msg 8939, Level 16, State 98, Line 1

    Table error: Object ID 1221579390, index ID 1, partition ID 72057594040745984, alloc unit ID 72057594044874752 (type In-row data), page (1:144696). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 46270473 and -4.

     

    There are 693108 rows in 16198 pages for object "DEADFILES".

    CHECKTABLE found 0 allocation errors and 91 consistency errors in table 'DEADFILES' (object ID 1221579390).

    repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (OurDB_ss.dbo.DEADFILES).

    DBCC execution completed. If DBCC printed error messages, contact your system administrator.

     

    This table is for long term archival purposes and not vital to the application according to the project manager. So I may have some wiggle room - but was wondering if anyone could tell me the next steps to repair something like this. Or is a repair_allow_data_loss our only option?

     

    (Partial quote of PUdba's post)

     

    Cool - the problem is that you have two corrupt data pages in the clustered index of DEADFILES. A row on each of these pages links to a LOB value (stored on another page). Because the data pages are corrupt and the rows are not processed, CHECKDB does not see the links from the data rows to the LOB pages - hence the two 8964 errors.

     

    If you run repair on this table then you will lose the data stored in the table rows on the two corrupt data pages. Do these corruptions exist when you CHECKDB on 2000? If so, you should repair them on 2000 before migrating to 2005. If not, then my guess is that the hardware you've restored to on 2005 is causing the corruptions.

     

    Kind of complicated to say what the best way for you to proceed is without knowing what your potential for data loss with old backups etc is.

     

    Thanks

    Monday, May 05, 2008 5:10 PM
  •  Paul Randal wrote:

     

    If you run repair on this table then you will lose the data stored in the table rows on the two corrupt data pages. Do these corruptions exist when you CHECKDB on 2000? If so, you should repair them on 2000 before migrating to 2005. If not, then my guess is that the hardware you've restored to on 2005 is causing the corruptions.

     

     

    I don't know if I should cry, celebrate or cuss... but thanks Paul. I restored a clean copy of the production database to the SQL 2000 Dev server, ran repair_rebuild and it found no issues. So I restored a copy of the database to the new SQL 2005 QA server and the new SQL 2005 Dev server <--- this is the one that had the error state 1 in the logs.

     

    Anyway, a DBCC CHECKDB on the new QA server ran clean... it was still throwing the error on the new Dev server. So, looks like it must be the hardware causing the corruptions. The server group (oh my!) is looking at it (and yes it took quite a bit of talking to convince them to look), but in the mean time, would you mind elaborating on your hardware guess? Does the error point to anything in particular like hard drives, memory, raid? Would like to give them a more narrow direction to search because they are saying their diagnostic application is not alerting them with a problem. They are looking to see if there may be a firmware upgrade needed on the drives though.

     

    Again, thank you so much for your time!

    Tuesday, May 06, 2008 4:01 PM
  • ok - one word of advice - never run a repair option as your default way of running CHECKDB. Firstly, you want to make your own decisions about fixing corruption rather than just always repairing it - even though REPAIR_REBUILD will rebuild non-clustered indexes, you will lose the corrupt data and hence the ability to do root-cause analysis of the corruption. Secondly, it requires the database to be in single-user mode and CHECKDB will run single-threaded - i.e. much slower.

     

    My hardware guess is based on the fact that your 2000 database is clean, and the corruption errors are not something I'd expect an Engine bug to produce but look like an IO subsystem corruption. I'd focus on the controllers and drives. The diagnostic utilities supplied by most hardware vendors will not find stress-related problems as they don't place a sustained high load on the IO subsystem. There's a utility you can download from MS called SQLIOsim which will allow you to induce high stress loads and see if anything breaks - checkout http://www.sqlskills.com/blogs/paul/CategoryView,category,SQLIOSim.aspx for more details.

     

    Also, you should enable page checksums on your 2005 databases too to aid in early discovery of IO problems. See http://www.sqlskills.com/blogs/paul/CategoryView,category,IO%2BSubsystems.aspx for lots more info. There's also coverage of page checksums in the upcoming June TechNet Magazine SQL Q&A column I wrote.

     

    Hope this helps!

     

    Tuesday, May 06, 2008 6:05 PM
  • The server group discovered that a firmware upgrade on 3 of the hard drives had not been applied. They just applied the upgrades and DBCC CHECKDB is no longer throwing errors. Wow. Am very, very glad it is fixed but am stewing quite a bit over the time I spent tracking it all down. On the bright side - I learned quite a bit more about SQL Server and the help provided on this forum. Thank you Paul - it is amazing you have the time and inclination to focus on issues such as mine. Thank you also to everyone offering suggestions.

     

    Rest assured my default in SQL 2005 is set to Checksum page verify and the default Checkdb is just that - a check... the research I have done over the past year had already made me decide those two things.

     

    Thanks again... and hope to be able to attend another conference and take some more sessions given by you and Kimberly.

    Tuesday, May 06, 2008 6:58 PM
  • You're welcome! Even though it takes time to hang-out on the forums, it's a great way to learn about issues people are hitting (and I just like helping people ).

     

    Glad you got it sorted out - look forward to seeing you soon. We're doing a bunch of stuff at TechEd in Orlando in June, TechEd EMEA in Barcelona in November, SQL Connections in Vegas in November, and PASS in Seattle in November - plenty to choose from!

     

    Cheers

     

    Tuesday, May 06, 2008 7:23 PM
  • Turns out the original 8967 error isn't a race condition as I thought, but has now been found to be a bug in CHECKDB (not mine ). See http://www.sqlskills.com/blogs/paul/2008/05/22/CHECKDBBugThatPeopleAreHittingMsg8967Level16State216.aspx

     

    Thanks

    Thursday, May 22, 2008 10:29 PM
  • Interesting... thanks for the follow-up!

     

    Tuesday, May 27, 2008 2:09 PM