I am having an intermitent problem with service broker where the intiator's sys.transmission queue will fill up with outbound messages. I've looked at the transmission queue message status and it is blank on the sender's db. I've profiled the target db and i'm getting a duplicate message error.
When i run select get_transmission_status('93C9FBF6-F951-DE11-9F2F-0022198CADC8') with the conversation handle i get the following message back,
A previously existing connection with the same peer was found after DNS lookup. This connection will be closed. All traffic will be redirected to the previously existing connection. This is an informational message only. No user action is required. State 116.
Now i would assume that it should be restarting the tcp connection between the 2 servers and then sending the ack messages back to the sender without problem but it never does. The only way i've found to fix the problem is to restart the sql server service on the sender's machine which closes and reopens the tcp connection and the messages will continue to process. Does anyone have any suggestions on what I should try to do to prevent this in the future?
I've almost exactly the same problem as you, although one element of the situation is reversed. This may be a side effect of the way we use Service Broker however - we send return messages from the target back to the initiator.
On the initiator, I have nothing in the transmission_status column, and nothing is returned when I run select get_transmission_status.
On the target, both the transmission_status column and get_transmission_status have exactly the same message as yours.
On both target and initiator we have messages stuck in the queue (usually many more on the initiator due to the nature of our application).
I have tried running SQL Profiler using a trace on all Broker events (and the broker-related Security events). Almost all the events are of class Broker:Remote Message Acknowledgement, Subclass 1 - Message with acknowledgement sent. We get very occasional pairs of Broker:Conversation (1 - SEND Message) with Broker:Message Classify (2 - Remote) but that's all. There are no errors visible in the trace - just no complete messages.
SSBDiagnose gives no further information:
D 29835 220.127.116.11 EmisMessageFarm Messages on conversation handle 441997cb-fd3b-de11-8296-005056a31b43 (conversation id 694d7e27-ebd1-4302-990d-51abf4d4950f) from service //e-mis.com/MonitoringService/SUP/Server to service //e-mis.com/MonitoringService/SUP/Client/869 cannot be transmitted dues to the following problem: A previously existing connection with the same peer was found after DNSlookup. This connection will be closed. All traffic will be redirected to the previously existing connection. This is an informational message only. No user action is required. State 116.
Like you we work around the issue by restarting SQL Server on the initiator (we leave the target running when we do that) and the messages start flowing again.
We're continuing to investigate and if I can find anything I'll post here but for the moment I'm as eager as you for suggestions on what else to look for.
There should be a better way of fixing the problem than restarting Sql Server. Could you please provide the contents of Broker:Connection events from the profiler traces you gathered (from both servers)?
The message you mentioned (about DNS lookup) is informational only (just as it says) and shouldn't be a reason for any concern nor for a communication failure.
Thanks for your reply. I agree absolutely that there should be a better way of handling this issue - we're only restarting SQL Server as a workaround until we can work out exactly what the issue is.
We had a recurrence of the issue this morning. I ran a trace on Broker:Connection events on both target and initiator.
There were no Broker:Connection events whatsoever on the initiator while the problem was still outstanding.
On the target (where the messages in sys.transmission_queue had the message in the transmission_status column), there were several groups of three events:
Broker:Connection 1 26142136 3 - Connect Failed 124bf21b-7aaf-4300-af7c-8b63833e8fb3 1 0X01 unknown 20 18.104.22.168 2009-08-18 09:00:03.937 A previously existing connection with the same peer was found after DNS lookup. This connection will be closed. All traffic will be redirected to the previously existing connection. This is an informational message only. No user action is required. State 116.
Broker:Connection 1 26142137 4 - Closing 124bf21b-7aaf-4300-af7c-8b63833e8fb3 1 0X01 unknown 20 22.214.171.124 2009-08-18 09:00:03.937 A previously existing connection with the same peer was found after DNS lookup. This connection will be closed. All traffic will be redirected to the previously existing connection. This is an informational message only. No user action is required. State 116.
Broker:Connection 1 26142138 5 - Closed 124bf21b-7aaf-4300-af7c-8b63833e8fb3 1 0X01 unknown 20 126.96.36.199 2009-08-18 09:00:03.937
There were many repetitions of this pattern in the trace (always with a different GUID, and with 'unknown' as the ObjectName. I may be wrong but I'm interpreting these as repeated attempts to open connections which are unsuccessful so the connection is immediately closed.
As I stopped SQL on the initiator, the trace on the target reported a connection closing:
Broker:Connection 1 26140977 4 - Closing 2186db21-8917-400d-b0c0-c473dc5e851a 1 0X01 tcp://188.8.131.52:33953 18 184.108.40.206 2009-08-18 08:58:54.557 An error occurred while receiving data: '64(failed to retrieve text for this error. Reason: 1815)'.
Broker:Connection 1 26140978 5 - Closed 2186db21-8917-400d-b0c0-c473dc5e851a 1 0X01 tcp://220.127.116.11:33953 18 18.104.22.168 2009-08-18 08:58:54.557
Note that there are no other references to that GUID in the trace - presumably the connection was originally created long before the trace started.
Unfortunately my trace was stopped on the initiator at that time.
Once I restarted SQL Server on the initiator, a connection was accepted on the target. For obvious reasons I was unable to start a trace on the initiator in time to capture connection events on that side:
Broker:Connection 1 26141160 6 - Accept 0bcbdab7-9dba-4238-aa0e-999049c18d05 1 0X01 tcp://22.214.171.124:0 20 126.96.36.199 2009-08-18 08:59:07.277
Hopefully this information is useful - if there is any further information or diagnosis that will help please let me know. We get an instance of this problem every 1-2 days so it shouldn't take too long before I can look again.
The messages you see at the target mean that before the target server tries to connect to the initiator, the initiator has already connected to the target. Therefore, after the target obtains the initiator's IP address, it says "Oh, I already have an open connection to that guy, there's no need to establish a new one". Just after these 3 events appear in the profiler trace, the target should have sent whatever it had to send, using the connection previously established by the initiator. One thing I would check is whether your DNS is working properly, so that it doesn't confuse the target server. The event you see when you stop the initiator server is also pretty obvious and expected.
Since my last message we've been doing some more diagnosis. We've had a couple of further occurrences and we're beginning to see a pattern.
Most of the time (day or night) we have a reasonably steady stream of messages being queued from the initiator to the target (busier during the day, but at least every few seconds during the night).
On an affected system last night there was a gap in messages to send at 00:13:02 of over 15 minutes. We were tracing TCP packets between the initiator and target.
At 00:13:56, a TCP Keepalive is sent from target to initiator. Initiator sends a TCK Keepalive ACK back immediately.
At 00:14:26 and 00:14:55, further keepalives are sent from target to initiator, with the initiator sending acknowledgements back immediately.
At 00:14:56 (300ms later), the initiator sends a TCP reset to the target.
At 00:20:14, the initiator opens a new connection to the target. However, when the next message is enqueued on the target (at 00:31) it remains in the queue.
Incidentally, no DNS is involved in our deployment because we are connecting purely using IP addresses.
I'm afraid it's impossible to investigate this kind of problem remotely, with only very limited information. How about trying the other way around? If you can post the scripts you used to create SSB objects on both instances, then people can review them and hopefully find what is wrong.
We had the exact same findings. We expect sql to start a new tcp connection but it never does and just sits there waiting for one to fire up. If there was a way to kill the connection without a restart or fix the root problem that would be better.
I closed all the existing conversations and then cleanup them. You need to run the below script and wait for a few minutes and try again. This work for me. First I tried with the CLEANUP only, but doesn't work, then I thought this may work. I don't know why.
declare @handleDelete UNIQUEIDENTIFIERWHILE EXISTS (SELECT conversation_handle From sys.transmission_queue s )
SET @handleDelete = (SELECT TOP 1 conversation_handle From sys.transmission_queue s)
END CONVERSATION @handleDelete;
END CONVERSATION @handleDelete WITH CLEANUP;