none
Performance degradation after adding Storage Spaces pool into Failover Cluster RRS feed

  • General discussion

  • Good day!

    We encountered the problem of performance degradation after adding Storage Spaces pool into Failover Cluster.  Outside the Failover Cluster Storage Spaces pool works just fine.

    We have two servers running Windows Server 2012 R2 Standard with JBOD connected by SAS. At one of them we've created storage pool made of 72 SAS drives (12 SAS SSD 800 GB и 60 SAS HDD 1,2 TB). The pool contains 4 Virtual Disks (Space) of same configuration: 2-way mirror with tiering, 1GB writeback cache, 4 colums, 64 KB interleave. The pool also contains quorum Virtual Disk (witness disk) of the following configuration: 3-way mirror without tiering and writeback cache, 4 colums, 64 KB interleave.

    We have tested performance for both Virtual Disks layers (SSDTier and HDDTier) with iometer. Their results were great just as expected – high IOPS and low latencies. For testing purposes we used file pinning with Set-FileStorageTier, then Optimize-Volume -TierOptimize.

    Between two servers we’ve created Failover Cluster. During Cluster Validation Tests no problems were noticed. Then we added all Virtual Disks into the cluster and have assigned the witness-disk for them.

    With Failover Cluster Manager we have added four roles of “File Server for general use“ (not Scale-Out File Server). For each of the file servers we have assigned separate Virtual Disk.

    During the same iometer’s performance tests we saw noticeable performance degradation for all Virtual Disks. Result analysis revealed that the root cause of regression is highly increased HDDTier latency (from 2 to 5 times beginning with queue depth = 1) for both read and write operations.

    We decided to disassemble the cluster and completely clear its Storage Spaces pool configuration. Then we reassembled the pool of Virtual Disks of the same configuration. New iometer test performance results were fine. Then we recreated the Failover Cluster and added disks into it. At this time we didn`t add File Server roles. And again performance test results showed us increased latencies (the same from 2 to 5 times).

    We have repeated our experiment several times and results were the same – performance degraded right after the pool and Virtual Disks were added into the Failover Cluster. It’s became obvious that the cluster is the reason of degradation.

    We have made full hardware testing with powershell ValidateStorageHardware.ps1 script

    https://gallery.technet.microsoft.com/scriptcenter/Storage-Spaces-Physical-7ca9f304 and it didn’t found any problem.

    We have changed the testing tool to diskspd. Test results were a slightly better than iometers so we decided that iometer doesn’t work right with clustered drives.

    We decided to perform high load cluster testing at production environment. Just after the SSDTier was filled up and HDDTler started using we began to receive complains from our clients. Perfmon have detected high latencies (from 20 ms and more) although the workload was not very high.

    At our production environment we have another Storage Spaces File Server (it is not a part of cluster). It is based on 30 drives pool (10 SATA SSD plus 20 SAS HDD) and works just fine – HDDTier latencies are never rise more than 6-8 ms nevertheless the workload is much higher than for the new one.  The workload character is the same.

    Our new pool and VirtualDisks were created with according to best practice advice and recommendations:

    -          drive count not more than 80 (we have 72)

    -          drive capacity not more than 10 TB (our VirtualDisk about 9 TB (1TB SSDTier + 8Tb HDDTier))

    VirtualDisk was created with compatibility for FastRebuild (1 SSD and 2 HDD were reserved).

    WriteBack cache is 1GB. Disks caching option are disabled.

     

    Can anybody help us with this cluster situation?

    Things we already have tried to do:

    -          checked write back cache amount influence – it doesn’t affected

    -          checked SAS HDD MPIO policy (by default – RR, try - LB and FOO) – it doesn’t affected

    -          checked disk own writeback policy settings (now it turned off on every our disks) – it also doesn’t

    The cluster and the pool configurations were cleaned with Clear-SdsConfig.ps1 (https://gallery.technet.microsoft.com/scriptcenter/Completely-Clearing-an-ab745947).


    • Changed type iNosorev Monday, February 29, 2016 10:54 AM
    Monday, February 29, 2016 10:50 AM

All replies

  • Did you ever get your issues straightened out?

    Philip Elder Microsoft High Availability MVP Blog: http://blog.mpecsinc.ca Twitter: @MPECSInc

    Sunday, February 4, 2018 6:55 PM