We have a 4 node cluster, A/A/A/P and each node is giving error 18456 severity 14 state 11.
The error is slightly different depending which node you are on.
If its an active node it gives
Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON', Reason: Token-based server access validation failed with an infrastructure error. [CLIENT: xxx.xxx.xxx.xxx]
If its the passive node you get
Login failed for user 'DOMAIN\SERVER$', Reason: Token-based server access validation failed with an infrastructure error. [CLIENT: <named pipe>]
The errors are happening on all nodes
So node 1 gets login errors from 2,3,4, node 2 from 1,3,4, node 3 from 1,2,4 etc.
I have run the below as suggested in one post but to no joy.
where type_desc LIKE 'WINDOWS%'
fetchcLogins into @login
declare @sql nvarchar(max)
set @sql = 'alter login ' + quotename(@login,'[') + ' with name = ' + quotename(@login,'[')
if ERROR_NUMBER() = 15098
print @login + ' failed validation with the system SID. This user/group account has been recreated. You will need to drop and recreate the login and associated database user accounts'
print @login + ' - ' + error_message() + ' (' + cast(error_number() as varchar(10)) + ')'
fetch cLogins into @login
There are no Windows scheduled tasks which span all servers, no SSIS packages which span the servers, no SQL jobs which span servers, no linked servers.
The only thing I see strange is that the MS Cluster Service is running as local system, but NT SERVICE\ClusSvc has public access only which is all it needs.
Any help would be appreciated.
We can connect to SSMS via Windows or SQL authentication successfully, its only when the nodes try to speak to other nodes.
As a side note, I have run a trace for Audit Logon Failed and its coming back with something trying to run SQLCMD as local system as its trying to run as NT AUTHORITY\ANONYMOUS LOGON. Just trying to find out what as I can see SQLCMD being spawned in task manager, unless its happening to fast for it to keep up.
I recently started this new job only a week ago so cannot comment on any changes.
One thing I have noticed is that as its a 3rd party who provide our hosting, there is a service called WOTS which runs SrvAny.exe and in the folder there is a load of SQL scripts, now I think this might be some custom monitoring the 3rd party has, think I might log a call to see if it calls SQLCMD and if we can disable the service to see if it does stop the errors and then tell them to make it use a domain account which we can give the nessesary rights. But not holding my breath on that one.
If you have SQLCMD running under the LocalSystem account, it means the original process must have local administrator rights and must be using some kind of impersonation technique such as PSexec to impersonate the LocalSystem account and then run SQLCMD. If your instance has the "NT AUTHORITY\SYSTEM" login with sysadmin privileges (which is the installation default), then that process will gain control over your instance. Be careful.
You should work with your hosting provider to solve the server to server authentication problem. It is something related to the domain infrastructure and/or account permission configurations. Also, you should check out when the problem started happening, and then try to correlate with any changes done to your domain infrastructure. Trying to find the cause with no history whatsoever will be more difficult.
SQL Server Database Administrator