none
Hyper-V Custer - STATUS_CONNECTION_DISCONNECTED issue

    Question

  • Hi,

    I hope someone can help, I'm having a large problem with one of my servers in a 3 server cluster. All the servers are the same Dell R710's connected to an Oracle/SUNl 7110 iSCSI SAN.
    I have two volumes between the 3 notes (CSV) with all my VM's stored within these. I have setup all 3 nodes identically (best I can) however on my lasted server (which is 6 months newer) I keep getting a problem where it losses connection to the SAN and then causes the cluster services to fail and migrate the VMs on a sister box.

    It very strange as this can take a week before this happens again or it can be in a couple of days it’s not consistent.

    I receive a number of errors in the event log:

     

    #1 - Cluster Shared Volume 'Volume1' ('Cluster Disk 2') is no longer available on this node because of 'STATUS_CONNECTION_DISCONNECTED(c000020c)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    #2 - Cluster Shared Volume 'Volume1' ('Cluster Disk 2') is no longer available on this node because of 'STATUS_WAIT_0(0)'. All I/O will temporarily be queued until a path to the volume is reestablished.

    #3 - Cluster Shared Volume 'Volume1' ('Cluster Disk 2') is no longer accessible from this cluster node because of error 'ERROR_TIMEOUT(1460)'. Please troubleshoot this node's connectivity to the storage device and network connectivity.

     

    This can happen on both Volumes, I've just highlighted the latest error I received, this is now happing when no VM are running on that server, I was just browsing the CSV to select a location for a new VM and everything halted, I then noticed errors on the cluster.

    I have run the cluster validation tool and everything is reported as ok.

    The iSCSI traffic has its own dedicated 2 ports (IMIP) to the SAN with only QOS and IP4 enabled, Hyper-V has its own 2 ports again for all VM traffic then a single management port and cluster network traffic with all the usual services enabled.

    All servers are running Windows 2008R2, McAfee 8.8 with all the needed exclusions for clusters and windows and DPM 2010 agents.

    Thank you for any help as I’m at wits end trying to figure on what’s going on.

    Rob Fuller
    Great Baddow High School

     

     

     

     

     

    Saturday, February 12, 2011 2:44 PM

Answers

  • Update: After rebuilding the server I have tracked the problem down to McAfee and some framework drivers, I guess when I tried uninstalling McAfee before it hadn’t removed everything and left some network drivers in place. I'm now investigating an alternative AV solution for that server, interesting how my other hyper-v cluster servers weren't affected by this.

     

    Many thanks for you suggestions and help

    Rob

    • Marked as answer by Rob Fuller Monday, May 16, 2011 10:00 AM
    Monday, May 16, 2011 10:00 AM

All replies

  • Also, bit more information:

    I not how a situation where I can browse cluster Volume 2 but not Volume 1, iSCSI is all connected fine and I can connect to both SANs web interface (using same network adaptor) without issue. Is this a problem with communication between the nodes on the cluster network more than a connection issue to the SAN directly?

    That server isn’t the current owner of the disk btw.

    Reboot and all will be fixing until next time...

     

     

    Update: Just a though, could this be a faulty NIC causing this?

     

    Thanks,
    Rob

     

    Saturday, February 12, 2011 2:53 PM
  • Yesterday I made Windows update un my 2 hyper-v hosts data center. Erevything was ok. 8 hours later my host1 was not able to have access to cluster disk They show offline in disk manager. ISCSI connect ok. I try updating my network card driver as a windows update show availble drivers.

    Any clue

     

    Tank's

    Ghislain

    Tuesday, February 15, 2011 6:30 PM
  • That server is the most up-to-date with Windows patches.....hmm

    Do you know what patches you have installed Ghislain, perhaps I can compare?

    Wednesday, February 16, 2011 3:42 PM
  • Another update; I’ve setup a test server running from a standard iSCSI lun only mounted oin that server, thats working fine.
    Now I'm at the stage where as soon any I/O for the CSV is created i.e. I migrate a VM to that server or attemp to browser the CSV via explorer results in error as posted above above to the extent where it will crash iSCSI Initiator and I have to reboot to get access to anything related to disk management.

    The normaly mouted lun is working fine thoughout all this, though I did notice before iSCSI Initiator completly crashed that one of the MPIO paths was 'reconnnecting'..

    I've also found some updated firmware for my broardcom NIC's so I've done that but hasn’t help thus far.

    Does anyone have any advice please?

    Friday, February 18, 2011 7:45 PM
  • Hi Rob,

    I'd be glad to help.  First, I would suggest running a full cluster validation report against all nodes selecting all tests.  We need to make sure that the cluster, and all nodes, and the hardware are solid and not the source of the problem.  One item of interest is to make sure that the driver and firmware of the network adapter in the new node is up to date.  And compare that driver to the one on the other nodes.  Let's see what we're dealing with here.

    Also, on the newer node we'll want to look in the event logs around the time of the last problem occurence.  Look for any errors or warnings either network or storage related. 

    ________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

    Sunday, February 20, 2011 5:33 AM
    Moderator
  • Hi Mike,

    I've run a full cluster validation and all is reporting fine, nothing unusual to report there. Can post if helpful?

    I've made sure the drivers and firmware are the latest from Dell, I have had no problems running a virtual machine in an iSCSI LUN for days now, but when a VM in the CSV is running that’s when all the problems start, almost instantly now.

    The iSCSI NIC’s are Broadcom BCM5709C NetXtreme II GigE TOE and iSCSI Offload Enabled, the other servers are running version 5.0.15.0 the latest one on this server is 6.0.29.0. (Note, I did start with that driver then upgraded to try and resolve problems)

    The errors I see in event viewer   are the one reported above, I can’t see any other warning prior to this.

     

    Thanks for your help,
    Rob

    Wednesday, February 23, 2011 3:45 PM
  • Hi Rob,

    If the cluster validation report came back clean then I don't need to see it.  Thanks for running it though.  I'm curious, did you have the new server connected to to the storage prior to adding it into the cluster?  I wonder if the same errors occured before as well. 

    1) You could always remove the new node from the cluster, connect it to the same SAN but obviously different storage and monitor to see if the same issue occurs.  If so, then you know that you're dealing with a problem on that server, maybe with the initiator or the connectivity.

    2) You could remove the cluster from management (VMM) while_leaving_the_new_node_in_the_cluster.  If the problem still occured it would rule out VMM as the source of the problem.

    ________________________________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

     

    Thursday, February 24, 2011 12:04 AM
    Moderator
  • 1) I've removed the node from the cluster and I am now running a single VM on a new lun connected to the same SAN storage, no problems!

    2) I also tried doing this as well, the fault still happens so I guess it isn’t VMM.

    Is there anything I can do to reset the cluster install on this server and trying joining again?

     

    Thursday, February 24, 2011 1:50 PM
  • The thing is I'm not so sure that the issue is actually related to the cluster or the instance of failover clustering.  To determine whether or not there are any issues with your cluster, on the main screen of the FCmgr click the link for "recent cluster events".  That is where to look for recent problems.  If there were a problem with the cluster than the cluster validation would've caught it.

    It sounds like you want to evict the node from the cluster.  However I would caution you to do so only as a last resort.  That is generally not a troubleshooting step that we recommend but only in the most dire circumstances.

    _____________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

    Saturday, February 26, 2011 5:47 AM
    Moderator

  • Hi Mike,

    The errors I reported originally are the same I'm seeing in the FCmgr events.

    I tried evicting the node, uninstall clustering then re-installed clustering and joined the node back, all looked ok. I then went to check the CSV by just browsing to the volume with explorer to make sure all the files where there ok. I couldn’t even open the first volume and the problem started strait away with STATUS_CONNECTION_DISCONNECTED errors etc.

    Does this count as a dire circumstance?

    I want to report this as a hardware fault but can it be if a normal VM servers runs without any problems on a normal LUN on the same SAN...

     

    Monday, February 28, 2011 10:38 AM
  • Are you doing any iSCSI offloading on the network adapters?

    --
    Hope this helps...
     
    Kurt Roggen [BE] - MVP
    Blog: http://trycatch.be/blogs/roggenk
     
     
    "Rob Fuller" wrote in message news:4ccce8fa-6c28-48ee-bb92-702bfc545bc2...

    Another update; I’ve setup a test server running from a standard iSCSI lun only mounted oin that server, thats working fine.
    Now I'm at the stage where as soon any I/O for the CSV is created i.e. I migrate a VM to that server or attemp to browser the CSV via explorer results in error as posted above above to the extent where it will crash iSCSI Initiator and I have to reboot to get access to anything related to disk management.

    The normaly mouted lun is working fine thoughout all this, though I did notice before iSCSI Initiator completly crashed that one of the MPIO paths was 'reconnnecting'..

    I've also found some updated firmware for my broardcom NIC's so I've done that but hasn’t help thus far.

    Does anyone have any advice please?

    Wednesday, March 02, 2011 6:40 AM
    Moderator
  • Yes, when I purchased the machine it came with the iSCSI offloading licence however looking at the iscsicli sessionlist command and acording to Dell (http://support.euro.dell.com/support/edocs/network/BroadCom/R125875/en/iscsi.htm#wp392203)   from Initiator Name , an iSCSI offloaded connection will display an entry beginning with "B06BDRV...". A non-offloaded connection will display an entry beginning with "Root...".

    Mine all start with Root...

    Note: the other servers did not have the iSCSI offloading licences enabled when I purchased the box.

    Wednesday, March 02, 2011 11:04 AM
  • OK, I've tried an Intel Adaptor I had in the server which was connected and working fine, however exactly the same error happen with this adaptor. I'm guessing this can rule out hardware at fault.

    The only thing I can think left to do is rebuild the server?

    Thursday, March 03, 2011 9:34 AM
  • Before rebuilding your server could you disable all offloading technologies on the NIC and revert the jumbo frame to default (if configured) and report back.
    Many thanks!

    --
    Hope this helps...
     
    Kurt Roggen [BE] - MVP
    Blog: http://trycatch.be/blogs/roggenk
     
     
    "Rob Fuller" wrote in message news:dafca047-2ca6-44fc-afa0-b31f5abf5340...

    OK, I've tried an Intel Adaptor I had in the server which was connected and working fine, however exactly the same error happen with this adaptor. I'm guessing this can rule out hardware at fault.

    The only thing I can think left to do is rebuild the server?

    Sunday, March 06, 2011 9:44 AM
    Moderator

  • Hi Kurt,

    Sorry for the belated reply, have been on some other projects, thank for your suggestions, I have tried both but it’s still failing.

     

    Rob

    Thursday, March 10, 2011 5:43 PM
  • Update: After rebuilding the server I have tracked the problem down to McAfee and some framework drivers, I guess when I tried uninstalling McAfee before it hadn’t removed everything and left some network drivers in place. I'm now investigating an alternative AV solution for that server, interesting how my other hyper-v cluster servers weren't affected by this.

     

    Many thanks for you suggestions and help

    Rob

    • Marked as answer by Rob Fuller Monday, May 16, 2011 10:00 AM
    Monday, May 16, 2011 10:00 AM