none
Exchange 2010 slow smart host connector delivery

    Question

  • Hi,

    We have a network security appliance, which can be used for SMTP AV scanning external mail, inbound & outbound.

    Some of our customers also want internal mail scanned.

    For those with Exchange 2007 and 2010, our solution is a custom transport agent which routes internal mail out to our appliance's smtp server. The appliance scans and passes the mail on, back into the organization.


    Problem
    One of our customers reports delays in internal mail. We need help to find out why.

    Details

    1. Their Exchange configuration appears to be in pretty good shape. All mail flow is fine. There's one HT in the organization, and it is running on a VMWare VM, 4x2GHzCPU, 16Gig ram, usually ~1 Gig free. BPA Health Check has two critical issues which are probably benign: "empty server container", and "unrecognized Exchange signature".

    2. Customer installs our transport agent. All external and internal mail flow is still great, initially.

    3. After an hour or two, or even less if it's a busy time in the organization, message latency for internal mail inceases to several minutes. Exchange Performance Monitor shows a 2 or 3 minute latency in the appropriate delivery queue being flushed. Exchange is still "managing" but at certain times of the day even a 2 or 3 minute latency is hurting the customer.

    4. By 24-36 hours the latency is sometimes massive - we have seen well over 10 hours on some messages. Delays are only occuring on internal mail.

    5. We've examined logs of our transport agent, Exchange pipeline tracing, Exchange protocol logging (ie on send/recv connectors), Exchange message tracking logs, Windows event viewer logs, Exchange Performance Monitoring, the Exchange Message Queue viewer. None of these show any errors. They show the transport agent receives messages from earlier stages in the transport pipeline very soon after submission. The transport agent finishes processing messages very quickly.

    There then seems to be an unexplained delay in Exchange commencing delivery of re-routed messages to our appliance.

    Once Exchange 'decides to do so', it connects fine to our appliance, sends X number of messages to our appliance, which scans and returns the message quickly. Exchange then has no issues completing delivery to recipient mailboxes.

    6. Disabling Forefront does not affect the problem. The only factor I have determined so far is the active presence of our transport agent.


    Here are some simple diagrams of problem/non-problem scenarios:

    Diagrams

    A. Internal Mail Delayed
    Here is a diagram of mail flow when a user sends internal mail. It shows unexplained delays in the smart host queue from Exchange to the appliance.

     1.[ Internal Mail ]
      |
      |
      V
     
     2.[ Hub Transport ]
     
      2.1. Transport Agent
       --> SetRoutingOverride(smtp.smart.host)
       --> Windows Event Log

      2.2. Exchange adds re-routed mail to smtp.smart.host delivery queue.

      2.3. We watch the smtp.smart.host queue in Exchange Queue Viewer
       --> INCREASING DELAYS OBSERVED HERE
       --> Messages remain in "Active" state with no Last Error
     
      2.4. Exchange connects to smart host
       --> sends some of the outstanding messages from smtp.smart.host queue
           using our custom send-connector for the smtp.smart.host routing domain.
       --> logs to Exchange Protocol Log
      |
      |
      V

     3.[ Smart Host ]
      Scans & routes mail.
      Pass through setting sends mail back to Hub Transport.
      Smart host logs show this processes is quick.
      Exchange Hub Transport quickly receives and routes mail to original recipient.


    B. External Mail (outbound OK)
    Here is a diagram of mail flow when a user sends mail outside the organization. There's no problem.

     1.[ External Mail, outbound ]
      |
      |
      V

     2.[ Hub Transport ]
     
      2.1. Transport Agent
       --> Allows email to pass
       --> Windows Event Log

      2.2. Exchange connects to smart host
       --> Uses default send-connector for External Mail.
       --> logs to Exchange Protocol Log
      |
      |
      V

     3.[ Smart Host ]
      Scans & routes mail --> Internet
      Smart host logs show this processes is quick.

    C. External Mail (inbound OK)
    Here is a diagram of mail flow when a user receives a mail from outside the organization. There's no problem.

     1.[ External Mail, inbound ]
      |
      |
      V
     
     2.[ Smart Host ]
      Scans & routes mail into the organization.
      |
      |
      V

     3.[ Hub Transport ]
     
      2.1. Receives mail on receive connector from the appliance

      2.2. Transport Agent
       --> Allows email to pass
       --> Windows Event Log

      2.3. Exchange delivers mail to internal user as resolved. -- User

     

    Attempted diagnosis/solutions

    We have examined Exchange's protocol logs to see if there are any smtp errors or problems connecting Exchange to our appliance. No issues. No error codes.

    We have tried generating load in our test environment. Exchange LoadGen is hard to manage - it often crashes or fails to initialise tests.

    I have written a simple powershell script to use the Send-MailMessage cmdlet to generate load. This certainly generates load, but Exchange handles it fine for a while, and then starts reporting errors/delays, as you would normally expect. These errors and delay notifications etc. do show up in our logs (Exchange + appliance). When I have generated high load myself, Exchange's Performance Monitor shows very high RPC latency.

    In the customer's case, when the problem is occuring, RPC latency is fine. The queue to the appliance is the only alarming stat. It creeps up over 20, 40, even up to 100. When Exchange sends messages it chips away at the queue according to the SmtpMaxMessagesPerConnection setting but there are often some messages which don't get through - they just stay "Active" with no explanation of why. We haven't ever seen the queue go over 250, so Exchange's Mail Flow troubleshooter thinks there's no problem.

    Exchange's various settings for controlling rates / throttling etc look fine. I tried decreasing and increasing the smart host send connector's SmtpMaxMessagesPerConnection setting from its default of 20 down to 5 and up to 100. This setting took effect but didn't affect the latency problem.

    I also thought maybe the transport server is "worried" that there are so many messages to one domain (smtp.smart.host) so increased the HT's MaxPerDomainOutboundConnections setting to 1000 - no difference.

    To give you an idea of the number of messages/users, in one 30 hour period there were approximately 1250 internal mails sent, just over 1000 got processed and delivered, most of which took 5 hours or more. In the meantime about 4000 external mails passed successfully in or out (practically no latency). The customer is a school. All groups (admin/teachers/students) experience the same problem.

    I can't see any evidence Exhange is using the Retry mechanism. I have tried manually Retrying the queue and I have tried adjusting the transient failure settings to make Exchange attempt more often - however neither Exchange nor appliance logs are showing any tcp connection or smtp failures anyway.

    Cycling MSExchangeTransport resets the problem (it all just starts again).

    More questions

    So, in the customer's scenario what else can I try to find out why Exchange is taking ages to decide to flush its queue to the smart host? Why does it wait 2 or 3 minutes to flush the queue and why doesn't it flush it fully? Where can I find out why messages are just staying active? Why does the condition get worse? Are there more logs I can activate to see what Exchange is doing?

     

    Thanks for reading! If you need any more details on anything let me know.

    • Edited by Trent Davis Friday, February 25, 2011 3:19 AM BPA Health Check info added to Details point 1.
    Monday, February 21, 2011 7:16 AM

All replies

  • Hi,

    Might be of no use whatsoever, however would suggest that you look at how the send connectors are configured and validate that both are configured to use the smart host in exactly the same manner - eg make sure that they are using the same IP or hostname.

     

    This may be relevant for you:

    http://msmvps.com/blogs/acefekay/archive/2010/10/11/edns0-extension-mechanisms-for-dns.aspx

     

    Also make sure that any firewall appliances are configured correctly - eg if using a Cisco ASA/PIX disable ESMTP inspection as it is known to cause issues.

     

    Cheers, Chris.

     

    Sunday, February 27, 2011 9:55 AM
  • Didn't you think in using Linked Connectors instaed of your Transport Agent

    http://technet.microsoft.com/en-us/library/bb201724.aspx

     


    MCP, MCSE 2000 , MCSA 2000 ,MCSA 2003 , MCITP , MCTS , MCT
    Sunday, February 27, 2011 3:02 PM
  • Hi Mohamed,

    Thanks for replying.

    I did look at the possibility of using Linked Connectors, but dismissed it when I saw the scenario mentioned in the link didn't match our case, where messages from the Internet are actually received by Exchange from the smart host. (So there would be a clash between Receive Connector A and B). The scenario doesn't include internal mail scanning either.

    Looking at it again I can see another approach: Change ReceiveConnectorA to receive messages from Exchange instead of the Internet. This should work because messages from the Internet already come from the Smart Host. I'll try that.

    Derek

     

    Monday, February 28, 2011 12:06 AM
  • Hi Chris,

    Thanks for the reply.

    The "smart host" send connector uses the smart host's domain name, and the "external mail" send connector uses the smart host's IP. So there's a difference there. That's one thing to try - setting both to IP in case there's an extended DNS related load/performance problem?

    Cisco is not in this picture. I'll check whether our smart host handles extended DNS OK.

    Derek

     

    Monday, February 28, 2011 12:28 AM
  • Hi Mohamed,

    Thanks for replying.

    I did look at the possibility of using Linked Connectors, but dismissed it when I saw the scenario mentioned in the link didn't match our case, where messages from the Internet are actually received by Exchange from the smart host. (So there would be a clash between Receive Connector A and B). The scenario doesn't include internal mail scanning either.

    Looking at it again I can see another approach: Change ReceiveConnectorA to receive messages from Exchange instead of the Internet. This should work because messages from the Internet already come from the Smart Host. I'll try that.

    Derek

     


    I've shelved this approach because I can't get Receive Connector A to receive internal mail from Exchange.
    Monday, February 28, 2011 9:19 AM