Unable to reach/ping Cluster role VIP
Trying to fix the issue in one of SQL Failover cluster instance, as he is unable to ping the FCI VIP after failover of the role to the second node. While from both nodes you still can reach/ping the SQL cluster VIP.
If failover File server role to different node what will happen? Is the issue affecting SQL FCI only?
Meanwhilefailover the File server role to second node , and suddenly the file server IP becomes unreachable. So the issue is affecting all Windows failover cluster roles in the Customer Site.
A senior network Engineer start checking the network switches and firewalls, he realized that the MAC address associated with the cluster IP addresses wasn’t changing to the MAC address of node VM02 when we failover the role from VM01 to VM02 – which is what we would expect as a result of the failover operation
Commands used during his troubleshooting:
IIt appears there is a registry entry in Windows which enables gratuitous Address Resolution Protocol (GARP) requests to be sent out when a failover occurs. By default this entry doesn’t exist in Server 2012 R2 and 2016 as well, I looked at the registry of node VM02. The registry entry was there but it was set to 0 – which is mean "don’t send garp" . So I set the value to 3, then gave the node a reboot. Once the node was accessible again, I carried out another failover test – and voila. only experienced a single ping drop this time before all 3 cluster IP addresses were accessible again So to get this working – Windows server registry object “ArpRetryCount” needs to be added or updated if it's exist as follow :
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters -REG_DWORD > ArpRetryCount
Values:
From Network Side make sure to enable the garp-reply :
To enable on Juniper EX & SRX platform – user the following command – br>
The interface can be a physical interface, logical interface, interface group, SVI or IRB To enable GARP
&
on Cisco IOS – use interface command
ip gratuitous-arps
Note: It just for troubleshooting purpose. Mainly we disable GARP from server side. In VMware environment "Virtual machines hosted on ESXI", it mandates to disable if you have Active-Active, Active-Passive sites. in order to send L2 packets to Core Switches. Additional Validation:
https://icookservers.blog/2016/07/19/windows-2012-r2-cluster-wont-send-gratuitous-arp-garp-packets-by-default