Answered Lagged database copy filling disk with logs suddenly.

  • 2012年4月27日 上午 10:41
     
     

    Hello,

    I'm hoping someone can shed some light on an issue I'm experiencing with a lagged database copy on one of our four mailbox servers.

    I have 11 dbs in a four node DAG, Exchange 2010 SP2 RU1:

    MBX01 - 5 active db's, 6 passive db copies

    MBX02 - 6 active db's, 5 passive db copies

    MBX03 - 11 passive db copies

    MBX04 - 11 passive lagged copies - 14 day lag replay and truncation.

    One of the 14 day lagged db copies on the MBX04 server has suddenly started eating through disk space on it's respective disk at a rate of 10GB and hour, to the point where the disk filled up yesterday, and the db copy failed.

    I removed the affected copy from Exchange, deleted the folder structure from the disk, and added a new db copy.

    The new copy seeded successfully and is currently in a healthy state. 

    The problem is the disk space is decreasing again this morning, at a similar rate.

    The database size is around 90GB, but there currently is around 130GB of log files, which is obviously not right.

    

    As you can see from the attached disk space graph up until 24/04 the free disk space on this disk held steady at around 50%. This then rapidly decreased until the disk was full on the 26th. The db files and logs were then deleted, and the disk space replenished. The decrease is happening again today however as illustrated.

    Can anyone assist with troubleshooting this problem?

    Regards

    Matt


    Matt

所有回覆

  • 2012年4月27日 下午 05:05
     
     

    We had a bug , which has been fixed in Sp2 Ru1 ,

    But you already have that.

    ----

    User is not giving High CPU to the server. As we have only lagged copies in the server

    ---

    Check for CPU usage  ?

    Any Email Archive Setup in your ORG ?

    Antivirus Exclustions set on your DB locations correctly ?

    Any recent changes made ?

    When Backup runs ? 

    Any Extra applications on MB04 ? 

    1 thing to clarify . Only the Copy increase in size , Not the primary ?

     or its something different


    Satheshwaran Manoharan | Exchange 2003/2007/2010 | Blog:http://www.careexchange.in | Please mark it as an answer if it really helps you

  • 2012年4月27日 下午 06:33
     
     

    Hi Satheshwaran,

    thanks for the reply.

    To answer the points above:

    CPU usage is fine

    No email archives in the org

    No AV running on the MB servers

    No recent changes made

    No backup running on the servers - backup less solution using JBOD array and lagged copies.

    No extra apps on the MBX04 box

    The primary db has not increased in size, nor have the log files increased to an abnormal amount.

    I tried seeding the new db copy from MBX03, instead of the primary db on MBX02 to see if this made a difference, however the disk space is still decreasing at the same rate.

    Matt


    Matt

  • 2012年4月27日 下午 06:52
     
     

    Am Suspecting only the Drive Here.

    Can you Check Bytes Per Sector Remains the same on the drive . (Compare Source And Target )

    Cmd -

    fsutil fsinfo ntfsinfo C:


    Satheshwaran Manoharan | Exchange 2003/2007/2010 | Blog:http://www.careexchange.in | Please mark it as an answer if it really helps you

  • 2012年4月28日 上午 07:53
     
     

    OK, I was thinking drive initially, but there is another lagged db copy on this disk which is fine.

    The storage disk (and five other disks per server) are mounted folders within the E:\ of each server (The E:\ is a mirrored pair of disks within the server). The actual disks are SAS attatched disks in a storage array directly attached to each server. The Dell Openmanage software that controls each server and attached storage array is reporting no issues with the physical or virtual disks.

    With regards to the fsutil command, how would I direct this toward the mounted folder - Storage Disk 1?


    Matt

  • 2012年4月28日 上午 08:24
     
     

    If its stored E drive run the below command

    fsutil fsinfo ntfsinfo E:

    ------------

    Matt , Isolating things will only help in our situation .

    Cause we don't have clue what is happening to the lagged copy.

    If its something by design .. The primary copy size should  also increase . In our case its something un usual.

    ------------

    Approx how many mailboxes , you got on that Db ?


    Satheshwaran Manoharan | Exchange 2003/2007/2010 | Blog:http://www.careexchange.in | Please mark it as an answer if it really helps you

  • 2012年4月28日 上午 09:09
     
     

    OK the data from both primary and lagged copy mailbox servers is below:

    MBX02 - primary active copy:

    NTFS Volume Serial Number :       0x429ead549ead40f1
    Version :                         3.1
    Number Sectors :                  0x000000000877e7ff
    Total Clusters :                  0x00000000010efcff
    Free Clusters  :                  0x0000000000bc4c14
    Total Reserved :                  0x0000000000000000
    Bytes Per Sector  :               512
    Bytes Per Physical Sector :       <Not Supported>
    Bytes Per Cluster :               4096
    Bytes Per FileRecord Segment    : 1024
    Clusters Per FileRecord Segment : 0
    Mft Valid Data Length :           0x0000000000f00000
    Mft Start Lcn  :                  0x00000000000c0000
    Mft2 Start Lcn :                  0x0000000000000002
    Mft Zone Start :                  0x00000000000c0f00
    Mft Zone End   :                  0x00000000000cc820
    RM Identifier:        EFDB6E03-5F67-11E0-B5A5-001C23DB0066

    MBX04 - lagged copy

    NTFS Volume Serial Number :       0x845e53c45e53ada4
    Version :                         3.1
    Number Sectors :                  0x000000000877e7ff
    Total Clusters :                  0x00000000010efcff
    Free Clusters  :                  0x0000000000b8b7ea
    Total Reserved :                  0x0000000000000000
    Bytes Per Sector  :               512
    Bytes Per Physical Sector :       <Not Supported>
    Bytes Per Cluster :               4096
    Bytes Per FileRecord Segment    : 1024
    Clusters Per FileRecord Segment : 0
    Mft Valid Data Length :           0x0000000000f80000
    Mft Start Lcn  :                  0x00000000000c0000
    Mft2 Start Lcn :                  0x0000000000000002
    Mft Zone Start :                  0x00000000000c0f80
    Mft Zone End   :                  0x00000000000cc820
    RM Identifier:        16892C21-5788-11E0-AD5D-001C23C18EFC

    There are 49 mailboxes on that database

    Matt


    Matt

  • 2012年4月28日 上午 09:54
     
     

    There is no format issues..

    --

    There might be a problem with the database itself.

    As there is just 49 mailboxes.. 

    Is it possible to move it over to a different one..

    I don't think even taking a trace won't help in our situation.


    Satheshwaran Manoharan | Exchange 2003/2007/2010 | Blog:http://www.careexchange.in | Please mark it as an answer if it really helps you

  • 2012年4月30日 上午 07:23
     
     

    OK, I will look at moving the mailboxes to a new database.

    Out of interest when a new database copy is created, which database is used to seed from if there are multiple copies? Will the active database always be used as the source, or will another passive copy be used if available?

    Over the weekend I activated the "problem" db on another server, and created a new db lagged copy assuming the active copy would be used as the source.

    Using the Update-MailboxDatabaseCopy cmdlet allows you to specify a source server, however there is no such switch when the Add-MailboxDatabaseCopy is used as far as I can tell.

    Matt


    Matt

  • 2012年4月30日 下午 05:47
     
     
    With Add-MailboxDatabaseCopy there's the -SeedingPostponed switch will then allow you to run Update-MDC at a later time. There could be a different couple of reasons to explain the issue you're seeing. You say that logs aren't being generated at an abnormal rate, what about log truncation? When you get the stats for a database copy, does the log truncation generation number increase? It's possible that a log file on the source became corrupt before getting copied to the last database copy, is no longer getting truncated and is being recopied every time you re-add the copy.
  • 2012年4月30日 下午 06:08
     
     

    Hi Jared

    Thanks for the reply. I've just gathered some stats using get-mailboxdatabasecopystatus cmdlet, on the server with the issue:

    Identity                         : Mailbox Database 2 - DEF\MBX04
    Name                             : Mailbox Database 2 - DEF\MBX04
    DatabaseName                     : Mailbox Database 2 - DEF
    Status                           : Healthy
    MailboxServer                    : MBX04
    ActiveDatabaseCopy               : mbx01
    ActivationSuspended              : False
    ActionInitiator                  : Administrator
    ErrorMessage                     : 
    ErrorEventId                     : 
    ExtendedErrorInfo                : 
    SuspendComment                   : 
    SinglePageRestore                : 0
    ContentIndexState                : Healthy
    ContentIndexErrorMessage         : 
    CopyQueueLength                  : 0
    ReplayQueueLength                : 118071
    LatestAvailableLogTime           : 30/04/2012 19:05:00
    LastCopyNotificationedLogTime    : 30/04/2012 19:05:00
    LastCopiedLogTime                : 30/04/2012 19:01:07
    LastInspectedLogTime             : 30/04/2012 19:01:07
    LastReplayedLogTime              : 29/04/2012 19:04:18
    LastLogGenerated                 : 1721178
    LastLogCopyNotified              : 1721178
    LastLogCopied                    : 1721178
    LastLogInspected                 : 1721178
    LastLogReplayed                  : 1603107
    LogsReplayedSinceInstanceStart   : 0
    LogsCopiedSinceInstanceStart     : 53598
    LatestFullBackupTime             : 
    LatestIncrementalBackupTime      : 
    LatestDifferentialBackupTime     : 
    LatestCopyBackupTime             : 
    SnapshotBackup                   : 
    SnapshotLatestFullBackup         : 
    SnapshotLatestIncrementalBackup  : 
    SnapshotLatestDifferentialBackup : 
    SnapshotLatestCopyBackup         : 
    LogReplayQueueIncreasing         : False
    LogCopyQueueIncreasing           : False
    OutstandingDumpsterRequests      : {}
    OutgoingConnections              : 
    IncomingLogCopyingNetwork        : 
    SeedingNetwork                   : 
    ActiveCopy                       : False


    Matt

  • 2012年5月2日 下午 06:29
     
     

    To see the state of truncation requires the perf counters. If you grab the truncation counter on all of the copies for the database, at least one of them should be progressing.

  • 2012年5月3日 上午 01:53
     
     

    From the status above, can see that

    ReplayQueueLength                : 118071

    and

    LastLogGenerated                 : 1721178
    LastLogCopyNotified              : 1721178
    LastLogCopied                    : 1721178
    LastLogInspected                 : 1721178
    LastLogReplayed                  : 1603107

    Subtracting 1603107 from 1721178 = 118071

    That's about 118 GB of log data - and is quite a lot for a 90 GB DB  :)

    So those numbers are making sense.

    What I think is happening is that you have an issue with this DB, but you are not seeing the symptoms on the active copy since you are probably using CRCL (continuous replication circular logging) as you are running backup less.  Since this is a lagged copy, ESE will keep the logs until the time specified for the replay, which is why this is visible on the lagged copy.

    The question is why are the logs being generated.....

    I'd start with running ExMon and seeing what users you can identify as causing high load - and then seeing what they are up to.  Some reading is also in Mike's awesome blog here:  http://blogs.technet.com/b/mikelag/archive/2009/07/12/troubleshooting-store-log-database-growth-issues.aspx


    Cheers, Rhoderick NOTICE: My posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

  • 2012年5月3日 上午 08:08
     
     

    Hello,

    So since I last posted, I've been doing some more investigation.

    Long story short....ExMon has allowed me to identify that there is an issue with one users mailbox in the active database.

    An extremely high CPU usage, along with a huge log bytes number led me believe there may be some corruption in this individual mailbox.

    As a result of these findings, I then moved the mailbox to another database hoping any corrupt items would be weeded out. The mailbox was moved successfully, but no corrupt items were found according to the New-MoveRequest log.

    This mailbox is definitely the cause of the problem with the lagged copy however, as the disk space problem has now moved to the disk of the database the mailbox has been moved to.

    So, the next thing I looked into was the mailbox folder structure, specifically the recoverable items folder and sub folders, as I thought some item(s) in here mail be causing the excessive log generation. We use Single Item Recovery and Litigation Hold here, but compared to other mailboxes the amount of data held in this users Recoverable items folders was excessive (13GB).

    Again... long story short, MFCMAPI allowed me to empty the Recoverable Items, Deletions, Purges, Versions folders.

    The users mailbox is now a "reasonable" size of 4GB.

    The disk space is still being eaten up at the same rate on the lagged copy server, and the users mailbox is still generating high CPU and log bytes within ExMon, so my question is - is this continued behaviour due to the lagged data still being "held" within Exchange's databases, even though potentially the "live" version of the mailbox is OK, or is there still an issue with this users mailbox?

    The user has had a new Outlook profile created on his laptop, and I'm now going to pause the lagged copy of the database and see if there is a correlation between this copy and the high usage in ExMon.

    Any other troubleshooting tips anyone can offer are welcome......

    Regards


    Matt

  • 2012年5月3日 下午 12:11
     
     

    Hi Matt

    Great to hear you are making progress. 

    ExMon monitors what the end user is doing, and that's against the active copy.  The lagged is just receiving the changes that are made there.  So something is up with the mailbox...

    I'd really want to see what this user is doing - do they have any special Outlook addins configured?  What is making them different from your other users?

    Does the behavior continue when they are out of Outlook and use on OWA for example?


    Cheers, Rhoderick NOTICE: My posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

  • 2012年5月8日 上午 09:35
     
     

    OK, so I'm not really any further down the road in terms of solving this issue.

    The user has had a new Outlook profile created on his laptop, but this has made no difference to the situation.

    Exmon is still showing high CPU etc... below is a an exported set of stats:

    

    In "ClientMon" view, the mailbox frequently has the most "Sessions" open - frequently being above 50. Again this is more than other mailboxes. W3wp.exe seems to be the process open all of the time, however other mailboxes have this process open as well.

    The user has a standard Outlook set up, with no extra add-ins, connectors, or mailboxes within his profile.

    Are there any suggestions for further troubleshooting tactics from anyone?

    I'm going to try and gather some ActiveSync usage stats to see if there is any useful leads from that perspective.....

    Regards

    Matt


    Matt

  • 2012年5月8日 下午 01:24
     
     

    Certainly see what devices they have partnered with the mailbox and look at the firmware versions of the devices too please.

    Also are you getting any warnings/errors logged into the app log for this user?

    What is in the RPC Client Access Logs for this person?  Look in the install folder and check this on all CAS unless you are hardcoding them to go to a specific CAS.


    Cheers, Rhoderick NOTICE: My posts are provided “AS IS” without warranty of any kind, either expressed or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

  • 2012年5月8日 下午 07:44
     
     

    Thanks for the reply Rhoderick

    The most recent RPC Client Access logs for all four CAS servers show nothing untoward as far as I can see regarding the user in question.

    The nearest thing to an error I could find was an entry regarding the public folder home server being different to the server on which the users mailbox resides.

    I've looked at the ActiveSync usage logs using LogParser, and again, nothing untoward as far as I can tell. The user is in the top three users with a large number of "Hits", but other than that i can't see any overly high stats compared to other users.

    I got one of our service desk engineers to run a virus scan on his laptop earlier, and whilst the scan was running the user's mailbox stats in ExMon were normal - which was strange in itself, and for awhile after the scan had finished. The scanned picked up nothing in the way of viruses, and Outlook was open throughout the scan. The stats ramped up again at around 16:30 and have stayed the same more or less ever since. Not sure if there is any correlation between these events, but thought it was worth mentioning...

    regards


    Matt

  • 2012年5月9日 下午 02:43
     
     

    Can you if possible, disable Active-Sync for this user and monitor.

    Also is it possible for this user to use OWA for a while and/or OLK in safe mode?


    Sukh

    • 已標示為解答 Matt_Pollock 2012年5月10日 上午 08:54
    • 已取消標示為解答 Matt_Pollock 2012年5月10日 上午 08:54
    •  
  • 2012年5月10日 上午 08:54
     
     已答覆

    !!!PROBLEM SOLVED!!!...

    ....Having watched the stats in ExMon for this user' mailbox for over a week, I was fairly sure there was no correlation between his mailbox being open in Outlook on his laptop and the high I/O being generated on the server. With this in mind I went back to the thought that this must be an issue with his mobile device, and ActiveSync.

    I re-ran an ActiveSync top 20 users report, using the Log Parser Studio tool, to try an see if there were any anomalies in the results.

    After a while I found that the users mailbox had logged 40 entries in the "IIS_507" column. No other users had any entries in this column. I figured further investigation would be needed to determine what these entries meant. A quick search on the internet found the following descriptions of the status codes, including the 507 code:

    http://blogs.technet.com/b/exchange/archive/2012/01/31/a-script-to-troubleshoot-issues-with-exchange-activesync.aspx

    With this in mind I was pretty certain the issue was with the mobile device. 

    I managed to get hold of the users Nokia E5-00 phone, which uses "Mail for Exchange" to access his corporate mailbox.
    I switched the device off, and immediately his Exmon stats fell to a normal level. I then rebooted the device, removed his mailbox from MFE and set up the profile again. Once the initial sync had taken place, his Exmon stats hit the previous high levels noted before.

    It's worth noting at this point, that although the users mailbox is a "reasonable" 4GB in size, most of that data is sat in the "Inbox" folder, and he has over 1000 unread emails in there. As this is the default folder for synchronisation on most ActiveSync / MFE devices, I figure this to be the issue.

    My assumption at this point is that there is just too much data in the inbox folder for the mailbox server and the phone to keep synchronised in an "always on" scenario - the default sync setting on a MFE device.

    I dropped the sync schedule to "every 30 mins" to ease the situation, and so far his Exmon stats have been normal, and the space gradually reclaimed on the lagged database copy server. I've advised the user he do some housekeeping in his Inbox to try and mitigate this issue from occurring again.

    I'll monitor the situation from here on in.........

    Thanks to everyone who's replied and helped with the troubleshooting process.

    Matt


    Matt

    • 已標示為解答 Matt_Pollock 2012年5月10日 上午 08:55
    •  
  • 2012年5月10日 上午 09:39
     
     
    Good to know it got resolved, after everyone's input here, it seem like the issue got resolved, 3rd part device eh:-)

    Sukh