locked
RPC errors on VM hosts on Cluster RRS feed

  • Question

  • I have a Node and Disk Majority Server 2008 R2 Cluster.  I'm getting some RPC errors all over the place.  If one of the Host servers tries to access a VM that is on it, it fails with an RPC error.  If I Live Migrate the VM to the other Host server, I can then access the files from the previous Host Server.  It also fails sometimes during live migration...yet if I go to the other server I can then live migrate it.  It's really strange, and I think it is causing some issues during the backup...if all the VM's aren't on the same Host Server during the backup, it kills those VM's. 

    The error I get is is Evenet ID 10009 Source DCOM:

    DCOM was unable to communicate with the computer queenhost.queencity.local using any of the configured protocols.

    any help would be appreciated!

    Thanks.

    Friday, January 28, 2011 3:41 AM

Answers

  • Hi Ramzi,

    Were you able to run a full cluster validation report against all nodes selecting all tests?  If so, what were the results?  Were there any Errors or Warnings?  Running cluster validation against a running cluster is a good idea to help identify and troubleshoot any problems with a cluster.  Another suggestion would be to look at the recent cluster events from the FCmgr snap-in.

    Also, I would recommend looking in the event logs for other Errors or Warnings on all nodes and investigate these.  From what you described, the problem sounds more like a misconfiguration between nodes rather than a problem with the cluster itself.

    ________________________________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

    • Marked as answer by mlbriggs Saturday, February 26, 2011 5:51 AM
    Wednesday, February 2, 2011 3:52 PM

All replies

  • Make sure that the DCOM setttings on all nodes are consistent, and that the Everyong group has Launch and Activation permissions.  Then if you still see the problem, drop VMM out of the picture and try the LM using the FCMgr snap-in.  Does the same problem still occur?  Then run a full cluster validation report against all nodes, selecting all tests.  Investigate any warnings or failures.  These could be the source of the problem.

    ________________________________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

    Friday, January 28, 2011 4:47 PM
  • I actually do use the FCMgr snap-in for this.  DCOM settings are the same on both hosts...everyone has both local and remote launch/activation permissions.  It still will not let me browse the host from a VM on that host, or vice versa.  I can browse the other host fine from the VM, and the other host can browse the VM if it is on the other host.  I also have inconsistent Backup Exec 2010 R2 performance.  I have to have all VM's on one node, or it will disconnect (briefly) the SAN connection of the other host.  For example, I have the tape drive on host2.  As long as I move all VM's to host2, I can backup without a crash.  But if I even leave one VM on host1, that VM will crash.  Same thing happens if I have the VM's spread out on both hosts and I backup a VM on host1 from host2's BUExec...host2 will crash.  It's weird...the host with the VM getting backed up is fine, the other's VMs crash (says it loses connection with the cluster storage).  Symantec says it has to do with the RPC errors, but I'm not sure.  Additionally, the speed varies greatly.  Sometimes it runs at 3,500 MB/sec, other times less than 1,000.  I know you can choose the network you backup on...do you know the best one for that?

    I've read destroying the cluster and re-creating it can clear up some errors, but I wasn't sure how safe that was for the data?  Also, just so you know...I do have the SAN network alone on a separate switch, each host has 2 NIC's with MPIO enabled hitting the IP of the Drive Array.  I have two cables directly between the two hosts, one for LM/CSV and one for the Heartbeat.  All gigabit.

    Thanks,

    Ramzi

    Saturday, January 29, 2011 4:09 AM
  • Hi Ramzi,

    Were you able to run a full cluster validation report against all nodes selecting all tests?  If so, what were the results?  Were there any Errors or Warnings?  Running cluster validation against a running cluster is a good idea to help identify and troubleshoot any problems with a cluster.  Another suggestion would be to look at the recent cluster events from the FCmgr snap-in.

    Also, I would recommend looking in the event logs for other Errors or Warnings on all nodes and investigate these.  From what you described, the problem sounds more like a misconfiguration between nodes rather than a problem with the cluster itself.

    ________________________________________________________________________________________________________

    Best Regards, Mike Briggs [MSFT] – Posting is provided "AS IS" with no warranties, and confers no rights

    • Marked as answer by mlbriggs Saturday, February 26, 2011 5:51 AM
    Wednesday, February 2, 2011 3:52 PM