none
Questions on Exchange corruption RRS feed

  • Question

  • Hi

    Environment: Exchange 2007 SP2 on Windows 2008 Server SP2. CCR/SCC clustering in place with SCR.

    Example scenario: Due to a driver issue, we recv 10xx alerts on our Exchange servers indicating that there is some corruption within the database. From my understanding, CCR is better in this scenario as the corruption can't be replicated to the passive node due to the Inspector directory. With SCC, on the other hand, there is only one set of data, so my possible options are:

    i. Move mailboxes on affected DB to another new DB

    ii. Restore from backup

    iii. Run ESEUtil /p to hard repair the database (last resort).

    I was hoping someone could answer some queries I had on this:

    1. Am I correct that most types of data corruption cannot be replicated with CCR/SCR technology. In which case, if the error happened on a CCR cluster, the best option would be to fail over to the passive node.

    2. If this happened on an SCC cluster, then the 'move mailbox' idea is the best? However, how can we ensure that we don't carry any corruption across/have any data loss when the mailboxes are moved?

    3. If we did decide to use ESEUtil /p, it would be safest to backup the database first. However, I always thought backups would fail if attempting to backup a database that had corruption?

    As I mentioned, this is an example scenario.

    Saturday, September 10, 2011 9:20 PM

Answers

  • Yes, for physical corruption, I would failover to the other node and mount, sorry I was probably typing that too fast!

    If that didnt work, then I would be thinking about moving mailboxes.

    You'll know when its physical, those errrors -1018 etc... are exactly the kind of things you will see.

    Logical corruption is probably a little harder to diagnose. It could be nothing but event log errors that you can ignore all the way to store crashes. Here is an example of a logical corruption that gets replicated to the both copies of the store in 2007:

    http://support.microsoft.com/kb/959135

    In this case, the quick fix was to run isinteg against the store after moving the mailbox that was causing the problem to another isolated store. Once isinteg was run, a reseed of the passive node was required.

    I dont know of any switch that does that, but to be honest, if there are items that are corrupt you dont want them anyway. Those are generally calendar items.

     

     

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:34 AM
    Sunday, September 11, 2011 1:10 PM
    Moderator
  • Oh, To your question:

    Do you know a command or switch in Exchange 2007 that we can use that will prevent data loss for any mailbox moves? That is, if there is likely to be any data loss for a particular mailbox, then exclude that mailbox from any moves?

     

    If you set data loss to 0 on mailbox moves, and there are corrupt items, then the mailbox wont be moved. So I guess in a round about way, that accomplishes what you want but it doesnt identify the corrupt items and exmerge/export to a pst wont export those corrupt items either, so typically you set moves to allow for some item corruption otherwise you may never be able to move those mailboxes.

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:34 AM
    Sunday, September 11, 2011 1:16 PM
    Moderator
  • If there is physical corruption that cant be corrected, the store wont mount. Thats when you would failover and attempt to mount the other copy.

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:35 AM
    Sunday, September 11, 2011 9:56 PM
    Moderator
  • Yep, it's possible to have -1018/19/22 errors and the store will remain up. No need to verify with eseutil, those errors arent lying.

    That goes back to the failover option. If you failover and there are no errors, then you fix the fix the hardware issue on the node that was throwing errors and then reseed from the active copy.

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:35 AM
    Sunday, September 11, 2011 10:42 PM
    Moderator

All replies

  • Logical corruption can certainly be replicated, physical, not so much.

    Regardless, my first step for logical corruption would be isinteg, followed by a reseed.

    For physical, assuming the store was still up, I would be moving mailboxes to a new store that has already been replicated.

    Restoring from backups would be the next step if the first two werent possible.

    eseutil/p would be pretty much off the table, but if did run it, I would move mailboxes from a repaired store to a new one that has been replicated.

     

     



    Saturday, September 10, 2011 10:01 PM
    Moderator
  • Hi Andy

    Thanks! Some more questions :-)

    "Logical corruption can certainly be replicated, physical, not so much." > How can we tell if the corruption is logical or physical? I guess physical corruption is caused by a hardware fault (driver etc), whereas logical? Looking at http://support.microsoft.com/kb/314917 ,it seems to me that

    1018/1019: Generally physical

    1022: Database error

    Would I be correct? How can we be sure? And if the error was 1018/1019, you don't recommend failing over to the passive CCR node to try and resolve?

    "....I would be moving mailboxes to a new store that has already been replicated."

    Do you know a command or switch in Exchange 2007 that we can use that will prevent data loss for any mailbox moves? That is, if there is likely to be any data loss for a particular mailbox, then exclude that mailbox from any moves?

    Finally, how are we alerted for these errors? I know there are errors in the Event Log, but how often do they appear? 

     

     

    Saturday, September 10, 2011 10:17 PM
  • Yes, for physical corruption, I would failover to the other node and mount, sorry I was probably typing that too fast!

    If that didnt work, then I would be thinking about moving mailboxes.

    You'll know when its physical, those errrors -1018 etc... are exactly the kind of things you will see.

    Logical corruption is probably a little harder to diagnose. It could be nothing but event log errors that you can ignore all the way to store crashes. Here is an example of a logical corruption that gets replicated to the both copies of the store in 2007:

    http://support.microsoft.com/kb/959135

    In this case, the quick fix was to run isinteg against the store after moving the mailbox that was causing the problem to another isolated store. Once isinteg was run, a reseed of the passive node was required.

    I dont know of any switch that does that, but to be honest, if there are items that are corrupt you dont want them anyway. Those are generally calendar items.

     

     

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:34 AM
    Sunday, September 11, 2011 1:10 PM
    Moderator
  • Oh, To your question:

    Do you know a command or switch in Exchange 2007 that we can use that will prevent data loss for any mailbox moves? That is, if there is likely to be any data loss for a particular mailbox, then exclude that mailbox from any moves?

     

    If you set data loss to 0 on mailbox moves, and there are corrupt items, then the mailbox wont be moved. So I guess in a round about way, that accomplishes what you want but it doesnt identify the corrupt items and exmerge/export to a pst wont export those corrupt items either, so typically you set moves to allow for some item corruption otherwise you may never be able to move those mailboxes.

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:34 AM
    Sunday, September 11, 2011 1:16 PM
    Moderator
  • Thanks Andy.

    In terms of logical v physical corruption, are the 10xx events always physical corruption then?

    Sunday, September 11, 2011 1:44 PM
  • Thanks Andy.

    In terms of logical v physical corruption, are the 10xx events always physical corruption then?


    Yes, its pretty safe to say that if you ever see those -1018, -1019 or -1022 errors, its physical.

    ( Physical errors are pretty rare with today's hardware, but can still happen)

    Error correction was added beginning with Exchange 2003 SP1

    http://support.microsoft.com/kb/867626

     


    Sunday, September 11, 2011 1:57 PM
    Moderator
  • Thanks and final question :-)

    If we did see those errors you mention in the Event Log, does that mean there definitely IS corruption and we should take action (failover, move mailboxes to another store etc) immediately rather than wait for users to notice?

    Sunday, September 11, 2011 7:12 PM
  • You mean the ones like the ones mentioned in

    http://support.microsoft.com/kb/959135?

    In that case the store crashes and failing over to passive node crashes as well since the problem is replicated.

    If the store doesnt crash, you could failover yes to see if the errors follow.

     

    Sunday, September 11, 2011 8:04 PM
    Moderator
  • The link you mentioned above is for logical corruption isn't it? From your comments before, I take it that simply failing over (if we're using CCR) is not going to help with that situation, and we're stuffed if we're using SCC, so the only option is to move mailboxes out/ restore? (And call PSS of course :-) )

    I was more referring to the 10## errors mentioned in http://support.microsoft.com/kb/314917 which I believe are physical. How do we know there is actually corruption there? Is there a command we can run to tell us for definite so we don't move mailboxes or fail over for no reason (ESEUtil /k for instance) preferably without having to dismount the store to run it? 

    Sunday, September 11, 2011 8:49 PM
  • If there is physical corruption that cant be corrected, the store wont mount. Thats when you would failover and attempt to mount the other copy.

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:35 AM
    Sunday, September 11, 2011 9:56 PM
    Moderator
  • Sure, but it is possible to have physical corruption and the store not dismount isn't it? Is the only way to verify that there was actually some corruption (as opposed to the alert being a false alarm) by using ESEUtil/k?
    Sunday, September 11, 2011 10:04 PM
  • Yep, it's possible to have -1018/19/22 errors and the store will remain up. No need to verify with eseutil, those errors arent lying.

    That goes back to the failover option. If you failover and there are no errors, then you fix the fix the hardware issue on the node that was throwing errors and then reseed from the active copy.

     

    • Marked as answer by Sophia Xu Thursday, September 15, 2011 5:35 AM
    Sunday, September 11, 2011 10:42 PM
    Moderator