locked
S2D Downtime when Controlling Node fails? RRS feed

  • Question

  • Hi All,

    I am continuing my testing of S2D.  So far a disk failure behaves as I would expect - slight drop in performance, but no noticeable down time for active data transfers.  Bringing down one node of the cluster is much the same unless that node is the owner of the clustered disk, in which case all IO to that disk seems to stop for quite some time.  This causes active workloads to fail and need restarted.  Once the disk is assigned to a new node (Happens automatically) workloads can be restarted and seem to work fine, but some data loss of active transfers seems to occur.

    Is downtime expected if the owner node fails?  Is there any way to prevent this?

    Would obviously like near 100% up-time through almost any type of failure...

    EDIT: For clarity I am testing this with 3 nodes each running 2 disks with a single mirror config and enclosure awareness on.


    • Edited by Thildemar Friday, May 13, 2016 5:21 PM
    Friday, May 13, 2016 5:21 PM

All replies

  • Hi Sir,

    >>Bringing down one node of the cluster is much the same unless that node is the owner of the clustered disk, in which case all IO to that disk seems to stop for quite some time.

    Yes , it is a normal behavior .

    "
    So how does the CSV handle failures?
    Suppose on one of the node I/O fails on the csv then following steps take place.

    1. All volumes across nodes related to this csv are indicated to start draining.
    2. All volumes go into a pause.
    3. Disk will go offline
    4. Attempt will be made to bring the disk online.
    5. Registration and reservation will be established to control over the disk.
    6. Volume will be mounted and and the instance will be attached to csv filter driver.
    7. CSVFS has about 120 seconds of timeout for volume transition.
    "

    https://www.linkedin.com/pulse/cluster-shared-volume-csv-deep-dive-himanshu-sonkar

    >> but some data loss of active transfers seems to occur.

    Based on my understanding , the data in cache which has not been written into disk will lose .

    Best Regards,

    Elton


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com .

    Monday, May 16, 2016 8:22 AM
  • Hi Claus,

    I saw a downtime of 60-120 seconds similar to what Elton mentioned when the node that had ownership of the shared virtual disk failed.  So:

    3 hosts in cluster each with two physical disks added to storage pool
    1 virtual disk in single mirror across the 6 physical disks (enclosure awareness on)
    Each Host has one VM running on the cluster shared volume
    Node 1 is the owner of the virtual disk

    If node 2 or 3 fail their respective VM moves and starts back up as expected
    If Node 1 fails all nodes pause access to CSV and their VMs freeze or crash, a min or two later one of them takes over ownership of the virtual disk and things start back up

    Am I missing something in the config?

    Monday, May 16, 2016 7:24 PM