none
Server 2012 Hosts Crashed When doing Live Migration. very serious Issue. nearly took out production cluster

    Question

  • Hi I have a 12 node 2012 Hyper-V cluster that's fully validated, and has being running fine.

    I have recently applied all the critical updates.

    I wanted to do some maintenance on one of the nodes, so I selected all the Virtual servers and did live migration to best possible node. this started ok, but it looks like one of the nodes where machines were being migrated too just crashed. (I checked it started up with blue recovery screen, doing a crash dump). Now this is where this issue gets serious. it seemed to cause a cascade effect, as the machines on the crashed node where being quick migrated, to other nodes they also crashed. eventually 5 nodes out of my 6 ALL crashed with the same error, if finally stopped, and almost took down all of production.

    I have opened a Premier support case, but has anyone else seen this or have any other ideas, I have seen these also happen on my lab cluster too.

    many thanks

    Mark Green

    Monday, November 25, 2013 10:28 AM

Answers

All replies

  • Update , after analysing the bug check,. it looks like it was netfs.sys which caused the bug stop check on all the nodes. it might be that even though I have a dedicated network for live migration it still caused a cluster heartbeat failure.

    Monday, November 25, 2013 11:15 AM
  • *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

    USER_MODE_HEALTH_MONITOR (9e)

    One or more critical user mode components failed to satisfy a health check.

    Hardware mechanisms such as watchdog timers can detect that basic kernel

    services are not executing. However, resource starvation issues, including

    memory leaks, lock contention, and scheduling priority misconfiguration,

    may block critical user mode components without blocking DPCs or

    draining the nonpaged pool.

    Kernel components can extend watchdog timer functionality to user mode

    by periodically monitoring critical applications. This bugcheck indicates

    that a user mode health check failed in a manner such that graceful

    shutdown is unlikely to succeed. It restores critical services by

    rebooting and/or allowing application failover to other servers.

    Arguments:

    Arg1: fffffa805145a980, Process that failed to satisfy a health check within the

           configured timeout

    Arg2: 000000000000003c, Health monitoring timeout (seconds)

    Arg3: 0000000000000000

    Arg4: 0000000000000000

    Debugging Details:

    ------------------

    PROCESS_OBJECT: fffffa805145a980

    DEFAULT_BUCKET_ID:  WIN8_DRIVER_FAULT

    BUGCHECK_STR:  0x9E

    PROCESS_NAME:  System

    CURRENT_IRQL:  2

    ANALYSIS_VERSION: 6.3.9600.16384 (debuggers(dbg).130821-1623) amd64fre

    DPC_STACK_BASE:  FFFFF803ACC0DFB0

    LAST_CONTROL_TRANSFER:  from fffff88007087845 to fffff803ac460440

    STACK_TEXT: 

    fffff803`acc07938 fffff880`07087845 : 00000000`0000009e fffffa80`5145a980 00000000`0000003c 00000000`00000000 : nt!KeBugCheckEx

    fffff803`acc07940 fffff880`07087516 : 00000000`00000004 fffff803`acc07c50 fffff803`acc07a79 00000000`00000000 : netft!NetftProcessWatchdogEvent+0xdd

    fffff803`acc07980 fffff803`ac4891ea : 00000000`00000004 00000000`00000000 fffff803`acc07c58 fffffa80`51bdc540 : netft!NetftWatchdogTimerDpc+0x36

    fffff803`acc079b0 fffff803`ac487655 : fffff803`acc07bf0 fffff803`ac488cff fffff803`ac700f00 fffff803`ac701920 : nt!KiProcessExpiredTimerList+0x22a

    fffff803`acc07ae0 fffff803`ac489668 : fffff803`ac6fe180 fffff803`ac700f80 00000000`00000003 00000000`0a621925 : nt!KiExpireTimerTable+0xa9

    fffff803`acc07b80 fffff803`ac488a06 : 00000c70`00000000 00001f80`00d60050 00000000`00000000 00000000`00000002 : nt!KiTimerExpiration+0xc8

    fffff803`acc07c30 fffff803`ac4899ba : fffff803`ac6fe180 fffff803`ac6fe180 00000000`00183de0 fffff803`ac758880 : nt!KiRetireDpcList+0x1f6

    fffff803`acc07da0 00000000`00000000 : fffff803`acc08000 fffff803`acc02000 00000000`00000000 00000000`00000000 : nt!KiIdleLoop+0x5a

    STACK_COMMAND:  kb

    FOLLOWUP_IP:

    netft!NetftProcessWatchdogEvent+dd

    fffff880`07087845 cc              int     3

    SYMBOL_STACK_INDEX:  1

    SYMBOL_NAME:  netft!NetftProcessWatchdogEvent+dd

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: netft

    IMAGE_NAME:  netft.sys

    DEBUG_FLR_IMAGE_TIMESTAMP:  5010aa07

    BUCKET_ID_FUNC_OFFSET:  dd

    FAILURE_BUCKET_ID:  0x9E_netft!NetftProcessWatchdogEvent

    BUCKET_ID:  0x9E_netft!NetftProcessWatchdogEvent

    ANALYSIS_SOURCE:  KM

    FAILURE_ID_HASH_STRING:  km:0x9e_netft!netftprocesswatchdogevent

    FAILURE_ID_HASH:  {fc992d70-4714-ccd6-c6b5-601c2a57cb6c}

    Followup: MachineOwner

    ---------

    Monday, November 25, 2013 12:10 PM
  • Wednesday, November 27, 2013 2:51 AM
  • brilliant thanks, yes they are HP BL460G7.

    this has resolved my issue many thanks

    Mark

    Thursday, December 12, 2013 3:34 PM