Issue

Unable to reach/ping Cluster role VIP

 

Trying to fix the issue in one of SQL Failover cluster instance, as he is unable to ping the FCI VIP after failover of the role to the second node. While from both nodes you still can reach/ping the SQL cluster VIP.

 

Setup

  • Windows cluster with two nodes VM01 and VM02
  • There are two SQL FCI's installed 2016
  • Each node has two NICs, one for the LAN and management network, and one for the heartbeat network
  • The cluster consists of three Network resource; a cluster IP address and 2 SQL instance addresses which float between the two nodes depending on which one is active.

 Steps

  • Check Windows Logs -nothing clear or related to the issue.
  • Checking SQL Los Patch Windows And SQL to the latest updated - still can't ping
  • Disable Symantec EP Firewall - still can't ping
  • Run Windows failover cluster validation - All tests where passed

If failover File server role to different node what will happen? Is the issue affecting SQL FCI only?

 

Meanwhilefailover the File server role to second node , and suddenly the file server IP becomes unreachable. So the issue is affecting all Windows failover cluster roles in the Customer Site.

 

A senior network Engineer start checking the network switches and firewalls, he realized that the MAC address associated with the cluster IP addresses wasn’t changing to the MAC address of node VM02 when we failover the role from VM01 to VM02 – which is what we would expect as a result of the failover operation

 

Commands used during his troubleshooting:

  • Show ip arp 10.10.2.x - "SQL Cluster IP" /li>
  • Clear ip arp 10.10.2.x - "SQL Cluster IP"

Resolution

IIt appears there is a registry entry in Windows which enables gratuitous Address Resolution Protocol (GARP) requests to be sent out when a failover occurs. By default this entry doesn’t exist in Server 2012 R2 and 2016 as well, I looked at the registry of node VM02. The registry entry was there but it was set to 0 – which is mean  "don’t send garp" . So I  set the value to 3, then gave the node a reboot. Once the node was accessible again, I carried out another failover test – and voila. only experienced a single ping drop this time before all 3 cluster IP addresses were accessible again So to get this working – Windows server registry object “ArpRetryCount” needs to be added or updated if it's exist as follow :

 

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -REG_DWORD > ArpRetryCount

 

Values:

  • 0 : don't send garp
  • 1 : send garp once only
  • 2 : send garp twice
  • 3 : send garp three times (The Default Value)

 

From Network Side make sure to enable the garp-reply :

To enable on Juniper EX & SRX platform – user the following command – br>

 

set interface interface_name/number gratuitous-arp-reply


The interface can be a physical interface, logical interface, interface group, SVI or IRB To enable GARP

on Cisco IOS – use interface command

ip gratuitous-arps 

Note: It just for troubleshooting purpose. Mainly we disable GARP from server side. In VMware environment "Virtual machines hosted on ESXI", it mandates to disable if you have Active-Active, Active-Passive sites. in order to send L2 packets to Core Switches

 

References

https://icookservers.blog/2016/07/19/windows-2012-r2-cluster-wont-send-gratuitous-arp-garp-packets-by-default