none
VMs Fail Randomly on 2012 Cluster

    Question

  • Our 2-node Server 2012 Hyper-V cluster is having an issue where VMs seem to randomly fail for no apparent reason.  We are using Dell R900s with a MD3200i SAN and we have separate networks for iSCSI and Heartbeat.  On one of the nodes we get an error 1069 "Cluster Resource 'Virtual Machine VM200X32' of type 'Virtual Machine' in clustered role 'VM200X32' failed."  Below I have listed the relevant cluster log events from around the time it fails.  The cluster has passed validation and is running 50+ test VMs just fine, it's only a few of them that seem to be having this issue.

    Just wondering if anyone else might have some input on what the problem could be.

    Cluster event log:

    2013/08/25-16:41:18.448 WARN  [RHS] Resource Virtual Machine VM200X32 IsAlive has indicated failure.
    2013/08/25-16:41:18.463 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Virtual Machine VM200X32', gen(0) result 1/0.
    2013/08/25-16:41:18.463 INFO  [RCM] Res Virtual Machine VM200X32: Online -> ProcessingFailure( StateUnknown )
    2013/08/25-16:41:18.463 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) Online-->ProcessingFailure.
    2013/08/25-16:41:18.463 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (VM200X32, Online --> Pending)
    2013/08/25-16:41:18.463 ERR   [RCM] rcm::RcmResource::HandleFailure: (Virtual Machine VM200X32)
    2013/08/25-16:41:18.463 INFO  [RCM] resource Virtual Machine VM200X32: failure count: 0, restartAction: 0 persistentState: 1.
    2013/08/25-16:41:18.463 INFO  [RCM] Will queue immediate restart (500 milliseconds) of Virtual Machine VM200X32 after terminate is complete.
    2013/08/25-16:41:18.463 INFO  [RCM] Res Virtual Machine VM200X32: ProcessingFailure -> WaitingToTerminate( DelayRestartingResource )
    2013/08/25-16:41:18.463 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) ProcessingFailure-->[WaitingToTerminate to DelayRestartingResource].
    2013/08/25-16:41:18.463 INFO  [RCM] Res Virtual Machine VM200X32: [WaitingToTerminate to DelayRestartingResource] -> Terminating( DelayRestartingResource )
    2013/08/25-16:41:18.463 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) [WaitingToTerminate to DelayRestartingResource]-->[Terminating to DelayRestartingResource].
    2013/08/25-16:41:18.463 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: Current state 'Online', event 'Terminate'
    2013/08/25-16:41:18.463 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: State change 'Online' -> 'Terminated'
    2013/08/25-16:41:18.463 INFO  [RCM] ignored non-local state Pending for group VM200X32
    2013/08/25-16:41:18.479 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration VM200X32', gen(0) result 0/0.
    2013/08/25-16:41:18.479 INFO  [RCM] Virtual Machine Configuration VM200X32: Flags 1 added to StatusInformation. New StatusInformation 1
    2013/08/25-16:41:18.479 INFO  [RCM] VM200X32: Added Flags 1 to StatusInformation. New StatusInformation 1
    2013/08/25-16:41:18.479 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.275 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration VM200X32', gen(0) result 0/0.
    2013/08/25-16:41:19.275 INFO  [RCM] Virtual Machine Configuration VM200X32: Flags 1 removed from StatusInformation. New StatusInformation 0
    2013/08/25-16:41:19.275 INFO  [RCM] VM200X32: Removed Flags 1 from StatusInformation. New StatusInformation 0
    2013/08/25-16:41:19.275 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.275 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: Current state 'Terminated', event 'VmStopped'
    2013/08/25-16:41:19.306 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.836 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.836 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.836 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: State change 'Terminated' -> 'Offline'
    2013/08/25-16:41:19.836 INFO  [RCM] HandleMonitorReply: TERMINATERESOURCE for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:19.836 INFO  [RCM] Res Virtual Machine VM200X32: [Terminating to DelayRestartingResource] -> DelayRestartingResource( StateUnknown )
    2013/08/25-16:41:19.836 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) [Terminating to DelayRestartingResource]-->DelayRestartingResource.
    2013/08/25-16:41:19.836 WARN  [RCM] Queueing immediate delay restart of resource Virtual Machine VM200X32 in 500 ms.
    2013/08/25-16:41:20.351 INFO  [RCM] Delay-restarting Virtual Machine VM200X32 and any waiting dependents.
    2013/08/25-16:41:20.351 INFO  [RCM-rbtr] giving default token to group VM200X32
    2013/08/25-16:41:20.351 INFO  [RCM-rbtr] giving default token to group VM200X32
    2013/08/25-16:41:20.351 INFO  [RCM] Res Virtual Machine VM200X32: DelayRestartingResource -> OnlineCallIssued( StateUnknown )
    2013/08/25-16:41:20.351 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) DelayRestartingResource-->OnlineCallIssued.
    2013/08/25-16:41:20.351 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: Current state 'Offline', event 'Online'
    2013/08/25-16:41:20.351 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: State change 'Offline' -> 'OnlinePending'
    2013/08/25-16:41:20.351 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'Virtual Machine VM200X32', gen(1) result 997/0.
    2013/08/25-16:41:20.351 INFO  [RCM] Res Virtual Machine VM200X32: OnlineCallIssued -> OnlinePending( StateUnknown )
    2013/08/25-16:41:20.351 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) OnlineCallIssued-->OnlinePending.
    2013/08/25-16:41:20.351 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:20.351 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration VM200X32', gen(0) result 0/0.
    2013/08/25-16:41:20.351 INFO  [RCM] Virtual Machine Configuration VM200X32: Flags 1 added to StatusInformation. New StatusInformation 1
    2013/08/25-16:41:20.351 INFO  [RCM] VM200X32: Added Flags 1 to StatusInformation. New StatusInformation 1
    2013/08/25-16:41:20.367 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:20.694 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:21.911 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine Configuration VM200X32', gen(0) result 0/0.
    2013/08/25-16:41:21.911 INFO  [RCM] Virtual Machine Configuration VM200X32: Flags 1 removed from StatusInformation. New StatusInformation 0
    2013/08/25-16:41:21.911 INFO  [RCM] VM200X32: Removed Flags 1 from StatusInformation. New StatusInformation 0
    2013/08/25-16:41:21.911 INFO  [RCM] HandleMonitorReply: LOCKEDMODE for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:21.911 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: Current state 'OnlinePending', event 'VmRunning'
    2013/08/25-16:41:21.942 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:21.942 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: 'Virtual Machine VM200X32' successfully started the virtual machine.
    2013/08/25-16:41:21.958 INFO  [RES] Virtual Machine <Virtual Machine VM200X32>: State change 'OnlinePending' -> 'Online'
    2013/08/25-16:41:21.958 INFO  [RHS] Resource Virtual Machine VM200X32 has come online. RHS is about to report status change to RCM
    2013/08/25-16:41:21.958 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:21.958 INFO  [RCM] Res Virtual Machine VM200X32: OnlinePending -> Online( StateUnknown )
    2013/08/25-16:41:21.958 INFO  [RCM] TransitionToState(Virtual Machine VM200X32) OnlinePending-->Online.
    2013/08/25-16:41:21.958 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (VM200X32, Pending --> Online)
    2013/08/25-16:41:21.958 INFO  [RCM] HandleMonitorReply: INMEMORY_NODELOCAL_PROPERTIES for 'Virtual Machine VM200X32', gen(1) result 0/0.
    2013/08/25-16:41:21.958 INFO  [RCM] ignored non-local state Online for group VM200X32

    Monday, August 26, 2013 4:36 PM

All replies

  • Hi,

    Are you running anti-virus on your hosts? Configure exclusions.

    Those VMs that fail are they all on the same node?

    Run the BPA to find out any problems. http://technet.microsoft.com/en-us/library/hh831400.aspx

    What does the Hyper-V event log tell you?

    Monday, August 26, 2013 7:15 PM
  • 1.  Double checked the exclusions, they are in place.  (We are running FEP 2010.)

    2.  Currently VMs that are failing are on one node.  This has happened in the past on both nodes though.

    3.  BPA did not come up with anything significant.

    4.  "the Hyper-V event log"?  Which one?

    Monday, August 26, 2013 8:18 PM
  • Hi,

    Whether the host computer has enough system resource in running all these VMs? From the log it seems the virtual machine is waiting for enough system resource before turned back on.


    TechNet Subscriber Support in forum |If you have any feedback on our support, please contact tnmff@microsoft.com.

    Wednesday, August 28, 2013 1:06 PM
  • It's a Xeon E7340 (4 CPUs, 16 cores) running at 2.4GHz with 128 gig of RAM and (2) dedicated 1 gig connections to the MD3200i iSCSI scan that only serves this one cluster.  It shouldn't be taxed by the current workload.
    Wednesday, August 28, 2013 1:11 PM
  • We are having the same issue, with the difference that we are using IBM hardware and FC. 

    We are running VDI, and all machines are installed in the same way and from the same image,but as far as I can see this is happening not on all of the VMs but on specific VMs . Some of them are failing everyday , other once a week, other never.

     The hosts are newly installed (1-2 months ago). We are just start using them and their workload is light.

    I've read this blog http://blogs.msdn.com/b/clustering/archive/2013/01/24/10388009.aspx, tring to understand what is happening.

    According to this and reviewing the log I noticed that this is happening on the exact time of the IsAlive check.

    It's strange that the cluster is restarting the VM immediately after IsAlive is unsuccesful, instead of waiting 5 minutes  "(Get-ClusterResource “Resource Name”).DeadlockTimeout ) " 

    In this KB http://support.microsoft.com/kb/914458 , microsoft speaks about what IsAlive is checking for, but there is no VMs ( I suppose because the KB is about Win2003 )

     


    • Edited by Georgi_M Wednesday, August 28, 2013 2:06 PM
    Wednesday, August 28, 2013 2:05 PM
  • It looked very much to me as if it was something to do with the keepalive/isalive thing but I am not familiar enough with how that should look to know for sure.  I thought it could be network related but we should be good there with our config I would think:

    Cluster Network 1 | Enabled | Connected to our internal network
    Cluster Network 2 | Internal | Heartbeat, connected to dedicated switch
    Cluster Network 3 | Disabled | iSCSI SAN, connected to dedicated switch
    Cluster Network 4 | Disabled | iSCSI SAN, connected to dedicated switch
    Cluster Network 5 | Enabled | Storage Network, connected to dedicated switch

    None of these are used to connect to the Hyper-V virtual switches that provide connectivity to the guests, they have their own NICs that are not shared with the operating system.

    Wednesday, August 28, 2013 2:48 PM
  • Today I was looking in the VMs, which are having this problem. They have minidumps. I analyzed them and found this:

    All of them have this minidump.

    ------------

    Use !analyze -v to get detailed debugging information.

    BugCheck 3B, {c0000005, fffff96000bb7644, fffff88001fc9d90, 0}

    Probably caused by : RDPUDD.dll ( RDPUDD+7644 )

    Followup: MachineOwner
    ---------

    2: kd> !analyze -v
    *******************************************************************************
    *                                                                             *
    *                        Bugcheck Analysis                                    *
    *                                                                             *
    *******************************************************************************

    SYSTEM_SERVICE_EXCEPTION (3b)
    An exception happened while executing a system service routine.
    Arguments:
    Arg1: 00000000c0000005, Exception code that caused the bugcheck
    Arg2: fffff96000bb7644, Address of the instruction which caused the bugcheck
    Arg3: fffff88001fc9d90, Address of the context record for the exception that caused the bugcheck
    Arg4: 0000000000000000, zero.

    Debugging Details:
    ------------------


    EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

    FAULTING_IP:
    RDPUDD+7644
    fffff960`00bb7644 498b5f20        mov     rbx,qword ptr [r15+20h]

    CONTEXT:  fffff88001fc9d90 -- (.cxr 0xfffff88001fc9d90)
    rax=0000000000000003 rbx=fffff900c201ccd8 rcx=f96000bd21570000
    rdx=0000000000000000 rsi=fffff88001fcafe0 rdi=fffff900c00e6020
    rip=fffff96000bb7644 rsp=fffff88001fca770 rbp=fffff88001fca7f0
     r8=0000000000000001  r9=0000000000000000 r10=0000000000000003
    r11=fffff88001fca768 r12=fffff900c26a9ca0 r13=fffff88001fcb0a0
    r14=0000000000000001 r15=0000000000000000
    iopl=0         nv up ei pl zr na po nc
    cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00210246
    RDPUDD+0x7644:
    fffff960`00bb7644 498b5f20        mov     rbx,qword ptr [r15+20h] ds:002b:00000000`00000020=????????????????
    Resetting default scope

    CUSTOMER_CRASH_COUNT:  1

    DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

    BUGCHECK_STR:  0x3B

    PROCESS_NAME:  chrome.exe

    CURRENT_IRQL:  0

    LAST_CONTROL_TRANSFER:  from 0000000000000000 to fffff96000bb7644

    STACK_TEXT: 
    fffff880`01fca770 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : RDPUDD+0x7644


    FOLLOWUP_IP:
    RDPUDD+7644
    fffff960`00bb7644 498b5f20        mov     rbx,qword ptr [r15+20h]

    SYMBOL_STACK_INDEX:  0

    SYMBOL_NAME:  RDPUDD+7644

    FOLLOWUP_NAME:  MachineOwner

    MODULE_NAME: RDPUDD

    IMAGE_NAME:  RDPUDD.dll

    DEBUG_FLR_IMAGE_TIMESTAMP:  50363a76

    STACK_COMMAND:  .cxr 0xfffff88001fc9d90 ; kb

    FAILURE_BUCKET_ID:  X64_0x3B_RDPUDD+7644

    BUCKET_ID:  X64_0x3B_RDPUDD+7644

    Followup: MachineOwner
    ---------

     I was digging around and found this site : http://answers.microsoft.com/en-us/windows/forum/windows_7-system/windows-7-sp1-remote-desktop/6385b422-39eb-4de8-a404-b7eb015bf107?msgId=59a055c2-d74b-4917-99c8-d1fc9c753b02

    so it seems to be related with RDP 8.0 ...

    Can you verify that you are using RDP 8.0 too ?

    Thursday, August 29, 2013 8:56 AM
  • These files would normally be in C:\minidump right?  I checked the VMs that got restarted last time this happened and none of them even have a C:\minidump folder.
    Thursday, August 29, 2013 12:29 PM
  • No, they have to in c:\Windows\Minidump 
    Thursday, August 29, 2013 12:37 PM
  • My bad, that's what I meant.  No c:\windows\minidump folder.
    Thursday, August 29, 2013 3:02 PM
  • Hi,

    For minidump files, check if it is first enabled in settings:

    http://support.microsoft.com/kb/315263/en-us

    And if it still does not exist, you may not have the same issue as Georgi_M. 


    TechNet Subscriber Support in forum |If you have any feedback on our support, please contact tnmff@microsoft.com.

    Sunday, September 01, 2013 10:29 AM
  • I checked that, it's turned on.  I don't think I'm having the same issue. 

    Any other ideas?

    Tuesday, September 03, 2013 12:09 PM
  • We have same problem. We have 2 guest operating system and they runs terminal services. Both of them randomly turns off 1 or 2 times per day. When I look at the windows logs i sess the machine was turned off. And in failover cluster logs the message is,

    [RHS] Resource Virtual Machine <servername> IsAlive has indicated failure.

    We have 15+ guest system but just two terminal servers give this errors.

    Friday, September 06, 2013 12:47 PM
  • When you say you see from the logs that the machine was turned off what do you mean?  Are you seeing "The previous shutdown...was unexpected" or do you see an event indicating that a process initiated a shutdown?
    Friday, September 06, 2013 12:51 PM
  • Just turned off. As on the picture below.

    And on the failover cluster logs the messages below are given.

    [RHS] Resource Virtual Machine NCOMPUTING5 IsAlive has indicated failure.

    [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Virtual Machine NCOMPUTING5', gen(0) result 1/0.

    [RCM] Res Virtual Machine NCOMPUTING5: Online -> ProcessingFailure( StateUnknown )

    [RCM] rcm::RcmResource::HandleFailure: (Virtual Machine NCOMPUTING5)

    Friday, September 06, 2013 1:57 PM
  • That does sound like what I'm experiencing.  On the host machine in the Hyper-V Worker/Admin log I get this:

    9/3/2013 11:48:38 PM 'VM935X29' was turned off.
    9/3/2013 11:48:43 PM 'VM935X29' started successfully.

    On the guest in the System log I see this:

    9/3/2013 11:49:21 PM The operating system started at system time 2013-09-04T04:49:20...Z.
    9/3/2013 11:49:35 PM The previous system shutdown at 11:48:21 PM on 9/3/2013 was unexpected.
    9/3/2013 11:49:25 PM The system has rebooted without cleanly shutting down first.

    The cluster log shows these events:

    2013/09/03-23:48:37.625 WARN  [RHS] Resource Virtual Machine VM935X29 IsAlive has indicated failure.
    2013/09/03-23:48:37.626 INFO  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'Virtual Machine VM935X29', gen(0) result 1/0.
    2013/09/03-23:48:37.626 INFO  [RCM] Res Virtual Machine VM935X29: Online -> ProcessingFailure( StateUnknown )
    2013/09/03-23:48:37.626 INFO  [RCM] TransitionToState(Virtual Machine VM935X29) Online-->ProcessingFailure.
    2013/09/03-23:48:37.626 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (VM935X29, Online --> Pending)
    2013/09/03-23:48:37.627 ERR   [RCM] rcm::RcmResource::HandleFailure: (Virtual Machine VM935X29)
    2013/09/03-23:48:37.628 INFO  [RCM] Will queue immediate restart (500 milliseconds) of Virtual Machine VM935X29 after terminate is complete.
    2013/09/03-23:48:37.628 INFO  [RCM] Res Virtual Machine VM935X29: ProcessingFailure -> WaitingToTerminate( DelayRestartingResource )
    2013/09/03-23:48:37.628 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: Current state 'Online', event 'Terminate'
    2013/09/03-23:48:37.629 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: State change 'Online' -> 'Terminated'
    2013/09/03-23:48:38.814 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: Current state 'Terminated', event 'VmStopped'
    2013/09/03-23:48:39.397 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: State change 'Terminated' -> 'Offline'
    2013/09/03-23:48:39.397 WARN  [RCM] Queueing immediate delay restart of resource Virtual Machine VM935X29 in 500 ms.
    2013/09/03-23:48:39.897 INFO  [RCM] Delay-restarting Virtual Machine VM935X29 and any waiting dependents.
    2013/09/03-23:48:39.897 INFO  [RCM] Res Virtual Machine VM935X29: DelayRestartingResource -> OnlineCallIssued( StateUnknown )
    2013/09/03-23:48:39.897 INFO  [RCM] TransitionToState(Virtual Machine VM935X29) DelayRestartingResource-->OnlineCallIssued.
    2013/09/03-23:48:39.899 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: Current state 'Offline', event 'Online'
    2013/09/03-23:48:39.899 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: State change 'Offline' -> 'OnlinePending'
    2013/09/03-23:48:39.911 INFO  [RCM] Res Virtual Machine VM935X29: OnlineCallIssued -> OnlinePending( StateUnknown )
    2013/09/03-23:48:39.911 INFO  [RCM] TransitionToState(Virtual Machine VM935X29) OnlineCallIssued-->OnlinePending.
    2013/09/03-23:48:43.972 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: Current state 'OnlinePending', event 'VmRunning'
    2013/09/03-23:48:43.980 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: 'Virtual Machine VM935X29' successfully started the virtual machine.
    2013/09/03-23:48:43.984 INFO  [RES] Virtual Machine <Virtual Machine VM935X29>: State change 'OnlinePending' -> 'Online'
    2013/09/03-23:48:43.984 INFO  [RHS] Resource Virtual Machine VM935X29 has come online. RHS is about to report status change to RCM
    2013/09/03-23:48:43.984 INFO  [RCM] Res Virtual Machine VM935X29: OnlinePending -> Online( StateUnknown )
    2013/09/03-23:48:43.984 INFO  [RCM] TransitionToState(Virtual Machine VM935X29) OnlinePending-->Online.
    2013/09/03-23:48:43.984 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (VM935X29, Pending --> Online)
    Friday, September 06, 2013 2:53 PM
  • you are going in  right direction. 

    The isalive test checks the presence of vwp.exe process for the VM. it will be a good idea to update vmclusres.dll as wel

    Do you see an event ID 1230 in the system event logs about RHS deadlock?

    but first thing first, install http://support.microsoft.com/kb/2784261 first. it is a MUST for 2012 cluster


    Mayank Sharma Support Engineer at Microsoft working in Enterprise Platform Support.

    Friday, September 06, 2013 3:10 PM
  • The isalive test checks the presence of vwp.exe process for the VM. it will be a good idea to update vmclusres.dll as wel

    Update what and how?

    Do you see an event ID 1230 in the system event logs about RHS deadlock?

    On the host?  When you say "logs" which logs should I be looking at?  I went ahead and created a custom view for all logs with the event ID 1230 on the host and it was empty.

    but first thing first, install http://support.microsoft.com/kb/2784261 first. it is a MUST for 2012 cluster

    That article lists 6 updates, 4 of which I have installed.  The ones I don't have on are KB2869923 which deals with CSV outages during backup and KB976424 which is a Server 2008 fix that needs to be installed on domain controllers.  I will go ahead and put on the CSV one 2869923.


    • Edited by Matt Br Friday, September 06, 2013 4:05 PM
    Friday, September 06, 2013 3:58 PM
  • These updates are alreadey installed on my system. My problem happens only on my terminal server guests. I think it is about terminal services.  And I have two terminal server. One of them is on local disks of the host (Because I wanted to try if the problem is about CSV) and the other one is on csv disks. And both of them have same problem.

    Friday, September 06, 2013 7:52 PM
  • So did you solve the problem? Matt.
    Friday, September 06, 2013 8:04 PM
  • So did you solve the problem? Matt.

    The issue I'm having doesn't happen all the time.  I'll let it simmer over the weekend and see if any VMs fail.
    Friday, September 06, 2013 8:25 PM
  • I have this problem at least one time a day. The 35 rdp users are very angry:(
    Friday, September 06, 2013 8:32 PM
  • Matt, Does your down guest  machine has terminal services?
    Friday, September 06, 2013 8:42 PM
  • Matt, Does your down guest  machine has terminal services?

    It's several VMs that fail actually but none of them have any other RD roles installed other than what the server normally uses for remote administration.
    Monday, September 09, 2013 12:45 PM
  • I am also having the same exact issue.

    Scenario:

    2 Dell PowerEdge R720 with Windows 2012 Datacenter acting as hyper-v hosts in a newly created Failover Cluster. 1 VM out of 10 is a terminal server 2008 R2 and only that one crashes.

    The VM will crash about once a week (this is random though so we could have 3 weeks fine and then one week it crashes twice).

    getting the same "isAlive" failure messages just before the cluster decides to shutdown the VM and restart it.

    Any ideas?

    Monday, September 09, 2013 6:37 PM
  • 2 Dell PowerEdge R720 with Windows 2012 Datacenter acting as hyper-v hosts in a newly created Failover Cluster. 1 VM out of 10 is a terminal server 2008 R2 and only that one crashes.



    What are you using for cluster storage?
    Monday, September 09, 2013 8:08 PM
  • Dell Equallogic (mix of PS40XX, PS 60XX and PS61XX) iscsi SANs.
    Monday, September 09, 2013 8:10 PM
  • Had a VM fail this morning at 6:42 so I guess the CSV/backup hotfix didn't fix it.  Any MVPs out there that could open up a case with MS for us?
    Tuesday, September 10, 2013 12:06 PM
  • I know it is a bit late,

    1. you need to install this update: http://support.microsoft.com/kb/2855336/en-us

    2. It should be in system event logs, this aeticle describes it well: http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

    3. It is good if you have installed it now, 


    Mayank Sharma Support Engineer at Microsoft working in Enterprise Platform Support.

    Friday, September 13, 2013 8:25 PM
  • Ok, So is your SAN ODX compatible, It is worth a try to disable ODX on the host as per: http://technet.microsoft.com/en-us/library/jj200627.aspx#DeployODX_Step3Establishaperformancebaseline

    Mayank Sharma Support Engineer at Microsoft working in Enterprise Platform Support.

    Friday, September 13, 2013 8:31 PM
  • I know it is a bit late,

    1. you need to install this update: http://support.microsoft.com/kb/2855336/en-us

    2. It should be in system event logs, this aeticle describes it well: http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

    3. It is good if you have installed it now, 


    Mayank Sharma Support Engineer at Microsoft working in Enterprise Platform Support.


    We do have that update installed already.
    Monday, September 16, 2013 2:57 PM
  • have you also tried disabling ODX?

    Mayank Sharma Support Engineer at Microsoft working in Enterprise Platform Support.

    Friday, September 20, 2013 9:04 PM
  • Hi,

    Is there any update on this?

    We have only 1 VM in our cluster having this exact issue, won't go down for a number of days or weeks, but as of this morning it has rebooted 5 times.

    The VM is our Exchange 2010 CAS so it is an issue for us.

    We have all the above updates installed and I have just tried disabling ODX (although our SAN is not compatible anyway).

    Monday, December 02, 2013 11:40 PM
  • Unfortunately no, it's still unresolved.  Been working with Microsoft premier (paid) support since 9/13/13 and I've been through them closing the incident when it wasn't resolved and being passed around to several support people who required me to totally re-explain myself from scratch and finally to the latest suggestion which is disabling heartbeat monitoring.  (I imagine an alarm company telling you to just disable the security system to fix the issue you're having with false alarms and what kind of response that would get...)

    I asked to be escalated to a Hyper-V and Failover Clustering expert and I was told on 11/1/13 that my call had been "escalated to tier3" and that I couldn't go any higher.  I was also told that the method we have been using to create new VMs (Blogged about by Microsoft here) is unsupported and is probably causing the failures.  Never mind that we've been doing this for years and it worked perfectly fine on our 2008R2 cluster and it continues to be fine on our standalone 2012 Hyper-V hosts.

    We have a really good team here, we only call Microsoft as a last resort.  Needless to say I'm very disappointed in the experience I've had so far and I'm about an inch away from just telling them thanks for nothing and having them close the call.

    Tuesday, December 03, 2013 4:07 PM
  • Just curious, is your cluster similar or dissimilar hardware? If dissimilar, have you checked Processor > NUMA settings and made sure it's using the current host hardware topology. I've seen issues like this (and others) when a VM has been migrated between hosts in a cluster with dissimilar hardware. Fixing the NUMA settings has resolved many of these issues for me.
    Wednesday, December 04, 2013 3:54 PM
  • The 2 cluster nodes are on identical hardware, both Dell PowerEdge R900 models with (4) Xeon E7340 CPUs and 128G of RAM.  This is interesting though, I'm going to have to look into the NUMA thing closer.

    Dell just recently released a bunch of firmware updates and I've heard that they are to address Server 2012 specific issues.  I went ahead and put them on our cluster nodes so we'll see if that helps at all.  Unfortunately it takes around 10 days for the VMs to start failing so it's time to hurry up and wait.

    Wednesday, December 04, 2013 4:43 PM
  •  Have u got an update from microsoft regarding this VM restart? We are also in the same page trying to resolve the issue for the last two months.We did the VM migration from windows 2008 to windows 2012 hypervisors.We did the export/import method for this activity.After the migration we started facing the issue of this unexpected restart of VMs.Still we are sending cluster loga and waiting for an update from microsoft for this.Pls let us know if you have aworkaroung to resolve this issue.

    Monday, January 13, 2014 6:11 AM
  • Microsoft didn't do anything for us except use up a lot of my time and our premier support hours.  We ended up retiring the Dell R900s and replacing them with Dell R710s with the same MD3200i SAN and it's been working flawlessly ever since.

    Wish I had something better to tell you.

    Monday, January 13, 2014 7:21 PM
  • That specific Bugcheck 3B of RDPUDD is actually fixed by this hotfix:

    http://support.microsoft.com/kb/2846226

    We had the two or three random Win 7 VDI VMs crashing each day in our Hyper-V cluster, and applying this fix instantly resolved the crashes.  Hope that helps.  I suspect others on the thread may have different issues, but I'm positive that this specifically resolves the 3B bugcheck.

    Thanks,

    Janssen


    Janssen Jones - Virtual Machine MVP -http://www.janssenjones.com - Please remember to mark answers as answers. :)

    Thursday, February 27, 2014 9:58 PM
  • We had this problem for a while and I had to check all the hosts for patch levels. I first updated all the hosts to have the same patches. Then I went and got all the Failover Clustering and Hyper-V updates that were applicable from these sites:

    Recommended hotfixes and updates for Windows Server 2012-based failover clusters
    Hyper-V: Update List for Windows Server 2012

    There's also a script that you can run that checks your server for missing Failover Cluster and Hyper-V updates here:

    https://github.com/it-praktyk/Get-WindowsHotfixes

    I used it at first, but found that I preferred just reviewing the hotfixes and getting the ones I knew were applicable for our environment.

    These patches are NOT part of normal Windows Updates checks you can do from the OS update checker.  It's a good idea to review these patches and make sure you have everything necessary applied to your hosts.

    Once I downloaded and installed all the patches, I re-verified patch levels across my hosts to make sure I didn't miss anything on any host. We haven't had any VM failures since. One of the patches may have fixed the issue, or simply getting all the hosts updated to the same patch level fixed it. Either way, this seems to have solved our issue.


    • Edited by CE-OPierce Friday, February 28, 2014 11:43 PM
    Friday, February 28, 2014 11:41 PM