none
Complete Cluster failure after initiating live migration?!

    Soru

  • Have a 2008 R2 SP1 two node failover cluster. The storage is on 2 HP Lefthand P4500 SAN's, connected over MPIO to both nodes.

    Have dedicated 1GB networks for iSCSI, CSV traffic, management and VM access, but not for live migration. Have configured the VM's to use the management network for live migration (considered to implement QOS Policies for restricting Live Migration Bandwidth, but have not noticed any impact of management access during live migration tests).

    Have only two VM's without any load running, the project is still in the deployment phase. Have done several live migrations without any problem, until last Thursday: Live migration of a VM started as usual. But at the moment of switching the VM to the new node, everything turned red in the Failover Cluster Manager.

    The disaster started by the 1135 event (Cluster node 'NODE1' was removed from the active failover cluster membership), followed by  three 1069 events (Cluster resource 'Quorum Disk' in clustered service or application 'Cluster Group' failed.) 1177, 7024, 7031 and 1038 events. On the other node, it started with a "QLNDNIC" 272 Error that indicates that the physical interface of the management network will be resetted because the device was not responding, then 6 seconds later, the node was removed from the cluster at the same second than the other node! Same events: 1135...

    • The logs from the network switches and from both P4500 SAN's have no abnormal events logged during this outage.
    • The very first event in the cluster.log file indicates a problem from the CSV volume where the VM's are running on: 000011f0.00000a60::2011/12/08-13:24:48.723 ERR   [RHS] Error 5023 from ResourceControl for resource CSV 1.

    Our customer has spent a lot of money for the reason that exactly this scenario never happen! And I can't imagine that a failure of the dedicated management network can bring down a whole cluster.

    Thank you all in advance for any help, or any starting point where to start to troubleshoot this issue!

    Franz



    • Düzenleyen FranzSchenk 12 Aralık 2011 Pazartesi 15:48
    12 Aralık 2011 Pazartesi 10:16

Tüm Yanıtlar

  • Hi Franz,

    The error 5023 points to the group or resource not being in the correct state

    I would suggest ensuring the resources on CSV1 to "Online" and run a cluster validation processs for failover, and re-try the failover process once complete

    If you post the cluster log along with a snippet of the Windows event log from the time of failover attempt I can look at this further for you.

     

    Useful Link: http://technet.microsoft.com/en-us/library/cc731844%28WS.10%29.aspx#BKMK_Step3

     

    Kind Regards,

    Martin

     


    If you find my information useful, please rate it. :-)


    • Düzenleyen MEIRL 13 Aralık 2011 Salı 00:03
    • Yanıt Olarak Öneren MEIRL 29 Aralık 2011 Perşembe 18:24
    • Yanıt Önerisini Geri Alan FranzSchenk 15 Ocak 2012 Pazar 23:38
    12 Aralık 2011 Pazartesi 23:55
  • I'm experiencing a very similar issue.  Seems to be network related given the QLNDNIC 272 event, but switch ports are all clean.  Any update?
    30 Aralık 2011 Cuma 16:08
  • I am having the same exact issue, with the same exact hardware.  Three HP DL380 G7 servers connected to HP P4500 San using MS MPIO.  I am running the very latest driver and firmware on the NC375T nic's. 

    Event 272, QLNDNIC

    DEVICE: HP NC375T PCI Express Quad Port Gigabit Server Adapter #12

    PROBLEM: Resetting the device because the device is not responding.

    ACTION: Adapter recovers from this error automatically.

    At this time, i have remove the NIC Information from the HP Management Agents console in the control panel.  I will have to wait and see if this works.

     

    28 Ocak 2012 Cumartesi 01:05
  • I had the same exact issue yesterday. The server is ProLiant DL980 G7, OS - Windows Server 2008 R2 Enterprice

    Event 272, QLNDNIC

    DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter
    PROBLEM: Resetting the device because the device is not responding.
    ACTION: Adapter recovers from this error automatically.

    DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter #4
    PROBLEM: Resetting the device because the device is not responding.
    ACTION: Adapter recovers from this error automatically.

    mmihm21, I couldn't find HP Management Agents console in the control panel, only HP Network Config Utility

    Could you be so kind where else I can find it? Or what I need to install?

    03 Şubat 2012 Cuma 09:31
  •  Fred B. _,

    We already have NIC firmware version 4.0.556. If to trust this article this version has no problem:

    NC375i Integrated Quad Port Multifunction Gigabit Server Adapter
    EARLIER than firmware version 4.0.556
    07 Şubat 2012 Salı 05:11
  • "This package contains firmware version 4.0.579 (which supersedes firmware version 4.0.556), as well as the driver."

    I don't know if you should read it as earlier than and equal to 4.0.556. The fix describes your issue. Our DL380's have the 382i and we have never seen that problem.

    07 Şubat 2012 Salı 08:04
  • Thank you very much for your help. The HP advisory describes the problem that we have.

    But it's not possible to install the new FW for the NC375T (SP55519.exe). When running the commands that are described in the installation instructions, we always get error "NetUserGetInfo:failed ," on our server core installation. Installation with HPSUM doesn't work either. Have just open a support call by HP.

    Franz

    15 Şubat 2012 Çarşamba 14:42
  • Hello friends

      We have two DL 580 G7 in a microsoft cluster. 2 months ago we had the problem described here, and we solved it flashing the firmware 4.0.579 and driver version 4.4.8.812 for the nc375i quad port integrated card. No problem till today

      This morning the problem started again. All 4 ports went down together and all services in the node inoperative.

    DEVICE: HP NC375i Integrated Quad Port Multifunction Gigabit Server Adapter #4

    PROBLEM: Resetting the device because the device is not responding.

    ACTION: Adapter recovers from this error automatically.

       Any idea of what can be happening? Firmware and driver version reviewed, and they are correct.  

       Regards

    02 Mart 2012 Cuma 15:23
  • Hi, All ~

    Have you configured to disable Chimney offload ?

    First configure Chimney offload with "netsh" Command, then you should configure it with "HP Network manager" too..

    Thanx.

    14 Mart 2012 Çarşamba 07:56
  • Hello Alberto

    We hab exactly the same issue with our HP DL580 G7. Did you solve the Problem?

    Thank you for the response.

    regards

    07 Mayıs 2012 Pazartesi 12:22
  • Hi,

    We have the smae issue with 4 nodes HP DL580 Hyper-V Cluster, I already updated to the latest firmware and driver and still facing the same issue. I opened a support ticket with HP and upo to now I did not receive their feedback!!!! 

    12 Mayıs 2012 Cumartesi 12:06
  • I had same issues, with HP DL580 G7.

    I've tried firmware updates, Chimney offload settings but no luck so far..

    but I heard that HP had some issue with G7 Model Servers, and they knew there are some problem with SPI Board some of G7 Models..

    I changed SPI board last week, and that event is not logged so far.(I know I cannot say that is cleared)

    Guys, Just Call HP or vendor and check it...

    Good Luck!!

    22 Mayıs 2012 Salı 00:29
  • Anyone with a NC375T or NC375i should read these:
    http://communities.vmware.com/thread/391045?start=0&tstart=0
    http://wahlnetwork.com/2011/08/16/identifying-and-resolving-netxen-nx_nic-qlogic-nic-failures/

    I have an ongoing case with HP, I have experienced issues with the latest hardware and firmware.
    Looks like it might be a problem with the SPI boards.

    30 Mayıs 2012 Çarşamba 12:58