locked
Higher than normal rate of change on a LUN that contains Exchange and SQL VMs RRS feed

  • Question

  • Hi all,

    Not sure that this is the right forum for the question, but Exchange 2007 is a possible culprit, so here goes.

    For about four years now, we've had a virtualized server environment that includes the same two VMs, both living on a VMFS partition on the same LUN, which resides on our Equallogic SAN.  The VMs are Windows Server 2008 SP2 running Exchange Server 2007 SP3 with rollup 3-v2 and also Windows Server 2003 SP2 running SQL 2000.  The Equallogic SAN replicates at a block level with another Equallogic SAN at our DR site.  The link is low bandwidth, so we are very much in touch with the size of the replication jobs. 

    For the first 2+ years, the jobs were of a manageable size.  We would complete replication cycles as quickly as every 30 minutes, with larger jobs taking as much as 2-3 hours.  The replication jobs were anywhere from a few hundred MB to maybe 5-8 GB. Now, this particular LUN is taking much longer than it once did, while other LUNs with different VMs are not.  Job sizes are anywhere from 5-65 GB.  Replication windows can take as long as 24 hours.  In extreme situations, we have to replicate manually (to physical media which is then driven to the DR site) in order to catch up.

    We can increase the amount of bandwidth used for replication by maybe 30-40%, and also upgrade the WAN optimizers (Riverbed) used on the connection, but I question whether this will be adequate.  Also, I would like to understand why this change has occurred.  We have fewer users than we once did.  Workload has not increased.  There have been few changes to the servers involved aside from service packs and hotfixes. 

    Whatever changes have been made have been charted against the trend in replication job size, and there doesn't appear to be a correlation there.  Really the only change has been the introduction of Exchange journaling, but this ran for months without any increase in the size of replication jobs.

    We've examined some possible Exchange related culprits - archiving of old email, Exchange maintenance windows, etc., but no changes or upticks in volume there.  We even went so far as to suspend archiving and extend the Exchange maintenance frequency to a longer period, but it had no effect.  Antivirus (Forefront Protection 2010 for Exchange) is only doing transport level and "realtime" mailbox access scanning and not doing passes against the entire store.

    At times, the environment has been shut down in anticipation of extreme weather.  After these shutdowns, the replication job sizes will frequently decrease dramatically for several days.  Given that the primary site storage is also shut down in these instances, I don't think this temporary relief can be attributed to the replication "catching up".

    The SQL server is very low use, with a small database that is used by our accounting department.

    Can any one suggest another area on which to focus attention, or possibly some tools to identify where all the churn is coming from?  Given that replication is at a block level, Dell / Equallogic cannot tell us.

    Thanks!

    Tuesday, November 27, 2012 4:02 PM

Answers

  • Hi,

    You can use the performance monitor to capture the log for analyzing:

    Monitoring Server Performance

    http://technet.microsoft.com/en-us/library/bb266984(v=exchg.80).aspx

    Monitoring Network Performance

    http://technet.microsoft.com/en-us/library/bb232196(v=exchg.80).aspx

    Performance Counters

    http://technet.microsoft.com/en-us/library/aa996329(v=exchg.80).aspx

    Thanks,


    Simon Wu
    TechNet Community Support

    Wednesday, November 28, 2012 7:53 AM
  • Thanks for the information, Simon.  As it turns out, most of the tools that really focused my attention in the right place are a bit beyond the scope of this forum.  While the links you provided gave me some validation that the Exchange server was the "chattiest" of the two servers living on this particular LUN, I was able to run some scripts on the VMware end of things to get a handle on changed blocks over time, with a fair degree of specificity:

    http://www.vmguru.com/articles/powershell/23-cbt-tracker-powershell-script-now-with-more-zombie

    While I did not immediately see any performance counters that would capture information at a process level, I did do a little unscientific "watching" in Reliability and Performance Monitor.  I looked in the Disk area, sorted by Write (B/min), and saw that the file pagefile.sys was being pushed to the top of the list more than one in a while.  This got me to thinking.  Ultimately, I moved a lot of unnecessary paging, both at a Windows level and at a virtual infrastructure level, off of the replicated storage, establishing mappings to workable equivalent resources at the remote site instead: 

    http://vmguy.com/wordpress/index.php/archives/1525

    I also got some virtual disks associated with System State backup processing (very large since 2008) off the replicated storage as well. 

    It's only been a few days, but things have been much, much better since then.  I'm going to give it a couple weeks, as I have seen temporary improvements before.  After that, I may call this fixed.

    • Marked as answer by sgravel Thursday, December 6, 2012 2:23 PM
    Thursday, December 6, 2012 2:23 PM

All replies

  • Hi,

    You can use the performance monitor to capture the log for analyzing:

    Monitoring Server Performance

    http://technet.microsoft.com/en-us/library/bb266984(v=exchg.80).aspx

    Monitoring Network Performance

    http://technet.microsoft.com/en-us/library/bb232196(v=exchg.80).aspx

    Performance Counters

    http://technet.microsoft.com/en-us/library/aa996329(v=exchg.80).aspx

    Thanks,


    Simon Wu
    TechNet Community Support

    Wednesday, November 28, 2012 7:53 AM
  • Thanks for the information, Simon.  As it turns out, most of the tools that really focused my attention in the right place are a bit beyond the scope of this forum.  While the links you provided gave me some validation that the Exchange server was the "chattiest" of the two servers living on this particular LUN, I was able to run some scripts on the VMware end of things to get a handle on changed blocks over time, with a fair degree of specificity:

    http://www.vmguru.com/articles/powershell/23-cbt-tracker-powershell-script-now-with-more-zombie

    While I did not immediately see any performance counters that would capture information at a process level, I did do a little unscientific "watching" in Reliability and Performance Monitor.  I looked in the Disk area, sorted by Write (B/min), and saw that the file pagefile.sys was being pushed to the top of the list more than one in a while.  This got me to thinking.  Ultimately, I moved a lot of unnecessary paging, both at a Windows level and at a virtual infrastructure level, off of the replicated storage, establishing mappings to workable equivalent resources at the remote site instead: 

    http://vmguy.com/wordpress/index.php/archives/1525

    I also got some virtual disks associated with System State backup processing (very large since 2008) off the replicated storage as well. 

    It's only been a few days, but things have been much, much better since then.  I'm going to give it a couple weeks, as I have seen temporary improvements before.  After that, I may call this fixed.

    • Marked as answer by sgravel Thursday, December 6, 2012 2:23 PM
    Thursday, December 6, 2012 2:23 PM