none
Guest file server cluster constant crashes RRS feed

  • Question

  • Hi

    I have make working a guest file server cluster with Windows Server 2019. the cluster crash constantly, being very slow and finally crashing all my hypervisors servers....

    Hypervisor infrastructure:

    • 3 hosts windows server 2019 LTSB datacenter
    • iSCSI Storage 10 Gb with 11 LUNs
    • cluster valid for all tests

    Guest file server cluster, 2 VM with the same config:

    • VM 2nd generation with 2019 LTSB Server
    • 4 virtual UC
    • 8GB of non-dynamic RAM
    • 1 SCSI controller
    • primary hard drive: VHDX format, SCSI Controller, ID 0
    • empty DVD drive on SCSI controller, ID 1
    • 10 VHDS disks on SCSI controller, ID 2 to 11, same ID on each node
    • 1 network card on virtual switch routing to 4 physical teamed network cards.
    • Cluster is valid for all tests except the network with one failure point for non redundancy.


    after some time, the cluster become very slow, crash and make all my hypervisors crashs. the only errors returned by Hyper-V is some luns became unavalaible due to a timeout with this message:

    Le volume partagé de cluster « VSATA-04 » (« VSATA-04 ») est à l’état suspendu en raison de « STATUS_IO_TIMEOUT(c00000b5) ». Toutes les opérations d’E/S seront temporairement mises en file d’attente jusqu’à ce qu’un chemin d’accès au volume soit rétabli.

    I have checked every single one parameters on VM and Hyper-V config, search with each hint I was given by logs but nothing and the crashes remains....

    and sorry for my poor language, english is not my main ability for speaking
    Thursday, July 25, 2019 9:21 AM

All replies

  • Hiya,

    a few things to monitor on your hosts.

    1: Disk latency. Basically your disk respondse times shouldn't be higher than 15 ms, with not too many spikes.

    2: Without knowing your disk setup, it is often the one ressource that gets drained fastest, because virtual environments often focus on kapacity and not IOPS :)

    So it could plain and simple be a disk IOPS exchaustion. And afterwards, when you troubleshoot, everything looks fine and dandy (Because the reboot flushes everything in that regards)

    Then if you find out your respondse times are too high, it's usually two things; a non IOPS prioritized design OR a scheduled batch job killing the disks :)

    Friday, July 26, 2019 6:25 AM
  • Hi,

    Thanks for your question.

    1)Please check the state of the CSV resources and check which resource under the CSV is offline. Right click  and show critical events to diagnostic the accident.

    2)Meanwhile, we continue to collect the system logs in the event viewer both on the nodes.

    3)Please also check the network connectivity between nodes and cluster shared storage. We can refer to this blog (https://techcommunity.microsoft.com/t5/Failover-Clustering/Troubleshooting-Cluster-Shared-Volume-Auto-Pauses-8211-Event/ba-p/371994), Due to one of common auto-pause reasons is STATUS_IO_TIMEOUT, because of intra-cluster communication over the network.  This is happening when SMB client observes that an IO is taking over 1-4 minutes (depending on IO type). If IO times out then SMB client would attempt to fail IOs to another channel in multichannel configuration or if all channels are exhausted then it would fail IO back to the caller.

    So, we can focus on the SMB events on nodes and stoarge server. Please check if any error message the event logs SMBclient and SMBServer as below.

    Applications and services logs > Microsoft > Windows > SMBclient

    Applications and services logs > Microsoft > Windows > SMBServer

    4)Regarding shared storage issue, we assure that the problematic CSV can re-online. If not, we remove CSV to become into avaiable storage and set it into maintainence mode. Then switch to the storage server which owns the disk.

    On the storage server, please check the disk which created virtual disk the cluster use in the diskmamagement console.

    5)Meanwhile, we suggest to patch your hypervisor servers with the latest update.

    Hope above information can help you.

    Highly appreciate your effort and time. If you have any question or concern, please feel free to let me know.

    Best regards,

    Michael


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Friday, July 26, 2019 6:40 AM
    Moderator
  • Hi

    thanks to both of you for your tips, i will check it on monday, week end is off in my school

    i have managed to stabilize it by deleting all VMs, configuration only, and recreating it. I've running both of the guest cluster node all the day on a standby hyper-v node without including them in the hyper-v cluster. no issue for all the day, both on the hyper-v and guest cluster. a new mystery. I'm touching wood (french expression)

    For the IOPS question, VMs and VHDSet are stored on a dell PS4210 with 24 SATA 10k and 900GO hard drives, iSCSI 10Gbs on a pair of dell N4230 switch stackeds. all the others LUNs on the 4210are under 25ms of read+write latency time according to my veeamone monitoring, i will check it on monday

    I also have another clue with a "new" security product which can perturb normal operations and installed on the guest cluster, DeepSecurity of TrendMicro. 

    Friday, July 26, 2019 8:22 PM
  • I also have another clue with a "new" security product which can perturb normal operations and installed on the guest cluster, DeepSecurity of TrendMicro. 

    Kill it, kill it with fire!!!!!

    On the less stupid side of quotes. It wouldn't be the first time I've seen issues with products like that, especially if its scanning from the host / storage side of things.

    Monday, July 29, 2019 6:19 AM
  • so!!

    i've check IOPS, no issues from that, metrics are normal compared to last month and week

    but, i've found something in Hyper-V ISCSI initiators logs and storage logs

    first: ISCSI logs:

    PINFRA2 7 Erreur iScsiPrt Système 29/07/2019 18:23:17

    L’initiateur n’a pas pu envoyer un PDU iSCSI. L’état de l’erreur est fourni dans les données de vidage.

    PINFRA2 20 Erreur iScsiPrt Système 29/07/2019 18:23:27 

    La connexion à la cible est perdue. L’initiateur va tenter de se reconnecter.

    with also errors 27,39,48 and 63

    on storage, i've this:

    Severity  Date and Time        Member  ID                        Message                                                                                                                                                                                                                                                                                     
    --------  -------------------  ------  ------------------------  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
     Info     29/07/2019 18:37:26  BEQLA   7.2.15 | 7.2.24 | 7.2.26  iSCSI session to target '10.1.3.23:3260, iqn.2001-05.com.equallogic:4-42a846-f41398a2d-3d3ccc75fd55c62e-vsata-03' from initiator '10.1.3.3:59869, iqn.1991-05.com.microsoft:pinfra2.infraserv.local' was closed. | iSCSI initiator connection failure. | Reset received on the connection.  

    and the same second: 

    Severity  Date and Time        Member  ID      Message                                                                                                                                                                                                                                                                                                    
    --------  -------------------  ------  ------  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
     Info     29/07/2019 18:37:26  BEQLA   7.2.47  iSCSI login to target '10.1.3.24:3260, iqn.2001-05.com.equallogic:4-42a846-f41398a2d-3d3ccc75fd55c62e-vsata-03' from initiator '10.1.3.3:60017, iqn.1991-05.com.microsoft:pinfra2.infraserv.local' successful using standard-sized frames.  NOTE: More than one initiator is now logged in to the target.  

    so it seems it's an iscsi storage connectivity issue....

    i have shut down the guest cluster in order to see if issues are caused by it or if it came from a deeper source...
    will see it tomorrow morning....

    Monday, July 29, 2019 4:44 PM
  • Hi,

    How are things going on?

    Best regards,

    Michael


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Thursday, August 1, 2019 8:48 AM
    Moderator
  • Hi

    very busy this last days, sorry

    the crash seems to appears only when the guest cluster VMs are bring into the hyper-v cluster. when they are running out of it, no crash is occuring....

    so it seems my hyper-v iSCSI configuration have a big issue but I can't start to see it.

    also, i've not found any interesting clues about the iSCSI issue manetionned upper.

    also, i have a ton of error in smbClient/security log, like this:

    Erreur 01/08/2019 13:22:15 SMBClient 31013 Aucun

    Échec de la validation de la signature.

    Erreur :La session d’utilisateur distant a été supprimée.

    Nom du serveur : \fe80::7c84:7274:250a:f078%23
    ID de session :0x5C0748000089
    ID d’arborescence :0x0
    ID de message :0x244AB0
    Commande : Session setup

    Aide :
    Cette erreur indique que les messages SMB sont modifiés en transit sur le réseau du serveur vers le client. Cela peut être dû à la fin de la session sur le serveur, à un problème de réseau, à un problème avec un serveur SMB tiers ou à une tentative de compromission de l’intercepteur.

    Fragment de paquet :0

    or this:

    Erreur 01/08/2019 13:27:00 SMBClient 31013 Aucun

    Échec de la validation de la signature.

    Erreur :{Toujours occupé}
    Impossible de libérer le paquet de demande d’E/S (IRP) spécifié car l’opération d’E/S n’est pas terminée.

    Nom du serveur : \fe80::7c84:7274:250a:f078%12
    ID de session :0x5C0004000039
    ID d’arborescence :0x0
    ID de message :0x22CE1
    Commande : Session setup

    Aide :
    Cette erreur indique que les messages SMB sont modifiés en transit sur le réseau du serveur vers le client. Cela peut être dû à la fin de la session sur le serveur, à un problème de réseau, à un problème avec un serveur SMB tiers ou à une tentative de compromission de l’intercepteur.

    Fragment de paquet :0

    Thursday, August 1, 2019 11:43 AM
  • Hi,

    Thanks for your detailed information.

    For now, we can emphasis to focus on the iSCSI connectivity and network connection. Please check the connectivity, and firewall or anti-virus between the nodes and clustered storage.

    Highly appreciate your effort and time. If you have any question or concern, please feel free to let me know.

    Best regards,

    Michael


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com

    Friday, August 2, 2019 7:24 AM
    Moderator
  • Hi

    firewall is currently disabled on all nodes, windows firewall as the trend micro one.
    antivirus is on but all scans are disabled for the c:\clusterstorage

    concerning network, I have 4 differents:

    • 1 for Hyper-V management, dedicated NIC, no vlan tag on windows but vlan untagged on switch
    • 1 for live migration, dedicated NIC, no vlan tag on windows but vlan untagged on switch
    • 2 for iSCSI network, dedicated NIC, no vlan tag on windows but vlan untagged on switch
    • 4 teamed NIC for the Vswitch, dedicated to VM use

    the management vlan is on netgear stuff

    iSCSI and live migration are on a dell 4032 stack, on 2 differents vlans. iSCSI is working ok, everyone ping everyone, same for live migration. the two of them are totaly isolated from the rest of the network, the stack is managed via OOB NIC. iSCSI LAN and Live Migration LAN are on completely differents subnet IDs, one 10.1.3.0/27 and one 172.16.0.0/27, our core network is on a 192.168.X.X model. the iCSI and live migration network don't have gateways and don't go out of the dell stack. but every 10 minutes, without explanations, i have some packets destinated to these 2 networks going out one via the hyper-v management network

    on my second site, I have another guest cluster configured similarly as this one but with SAS direct attached storage connection for the hyper-v. any problem or lag for this one.

    a little clarification, i'm out of the office for 4 weeks, resuming on september, 2th. so i can't try anything before this date. I have shut down the cluster and back up to the old configuration for now, because we can't start the new study year like that. i will continue to work on it by the time i will return to the office 

    Sunday, August 4, 2019 7:37 PM
  • hi all

    even during holidays, i'm thinking ^^'

    i've seen this in the 2019 server documentation on microsoft website: Using Storage Spaces Direct in a virtual machine | Microsoft Docs
    i
    t's saying we should increase I/O timeout delay in order to avoid storage latency.

    should it be the beginning of an answer? have you already try this solution?

    Wednesday, August 21, 2019 1:51 PM