locked
Exchange 2003 Mailbox Server - 6.6 million X-Link2States per day RRS feed

  • Question

  • Hello all,

    We are currently running a 6 server native Exchange 2003 organisation with 2 Active-Active MB Clusters and two front end servers.  One routing group is configured with all six servers as members. MBCL01 is the routing group master, all other servers are members.

    We recently changed the routing topology by adding an additional connector to route mail out of a remote site, this meant that the Master server would now be handling more SMTP traffic, along with OWA01 (the FE server at the site in question).  We've noticed a reduction in performance, mainly around delays in sending/receiving mail, and outlook performance for mailboxes hosted on MBCL01.  We identified that the inetinfo service is consuming ~20% processor resource and carrying out around 2billion IO reads per day.  Also the SMTP log file sizes have increased dramatically and are now around 2GB a day (whereas they used to be around 200MB).

    After analysing the SMTP logs we found that the MB server was receiving 6.6 million link state updates (via the X-Link2State verb) from itself every day.  WinRoute shows that all 6 members are unable to connect to the master.

    Other points worth noting - all servers are on the LAN (i.e. ports 691 and 25 are open on all boxes).  I've got a feeling this problem has existed for a while, it's just manifested itself when we made the routing change, the security logs on the server have always logged hundreds of successful authentication attempts per second from the system account.

    I've checked most of the points in this article: http://support.microsoft.com/kb/832281 - everything looks ok.

    Any help would be appreciated.

    Thanks, Gareth.


    Tuesday, May 10, 2011 9:21 AM

Answers

  • On Tue, 10 May 2011 09:21:00 +0000, Gaz Jones wrote:
     
    >We are currently running a 6 server native Exchange 2003 organisation with 2 Active-Active MB Clusters and two front end servers. One routing group is configured with all six servers as members. MBCL01 is the routing group master, all other servers are members.
    >
    >We recently changed the routing topology by adding an additional connector to route mail out of a remote site, this meant that the Master server would now be handling more SMTP traffic, along with OWA01 (the FE server at the site in question). We've noticed a reduction in performance, mainly around delays in sending/receiving mail, and outlook performance for mailboxes hosted on MBCL01. We identified that the inetinfo service is consuming ~20% processor resource and carrying out around 2billion IO reads per day. Also the SMTP log file sizes have increased dramatically and are now around 2GB a day (whereas they used to be around 200MB).
    >
    >After analysing the SMTP logs we found that the MB server was receiving 6.6 million link state updates (via the X-Link2State verb) from itself every day. WinRoute shows that all 6 members are unable to connect to the master.
    >
    >Other points worth noting - all servers are on the LAN (i.e. ports 691 and 25 are open on all boxes). I've got a feeling this problem has existed for a while, it's just manifested itself when we made the routing change, the security logs on the server have always logged hundreds of successful authentication attempts per second from the system account.
    >
    >I've checked most of the points in this article: http://support.microsoft.com/kb/832281 - everything looks ok.
     
    The last time I had to deal with a problem like that had to be at
    least seven or eight years ago!
     
    If you can't get the members of the RG to connect to the master, try
    moving the master to another machine. If that doesn't work, stop the
    RESvc services on each server in the RG and then restart them.
     
    Have you changed the FQDN on the SMTP Virtual Server? Is there a
    corresponding A record for the name in your internal DNS? Is there a
    SPN for the name?
     
    If the problem's caused by a stale route (or multiple stale routes)
    then the surest way to remove those routes is to shut down ALL the
    Exchange servers in the organization. Then restart the FE servers,
    then the BE servers. Because the link-state information is kept in
    memory you can't just reboot the machines one at a time. If you do
    that, and the member/master communications starts to work, you'll just
    replicate the stale routes from another machine. ALL the machines have
    to stopped before you restart any of them.
     
    Just be thankful you have only six servers. At that time we had 120
    machines and they were spread over very continent except Antarctica.
     
    ---
    Rich Matheisen
    MCSE+I, Exchange MVP
     

    --- Rich Matheisen MCSE+I, Exchange MVP
    • Marked as answer by Gaz Jones Monday, May 16, 2011 8:10 AM
    • Marked as answer by Gaz Jones Monday, May 16, 2011 8:10 AM
    Tuesday, May 10, 2011 9:30 PM

All replies

  • On Tue, 10 May 2011 09:21:00 +0000, Gaz Jones wrote:
     
    >We are currently running a 6 server native Exchange 2003 organisation with 2 Active-Active MB Clusters and two front end servers. One routing group is configured with all six servers as members. MBCL01 is the routing group master, all other servers are members.
    >
    >We recently changed the routing topology by adding an additional connector to route mail out of a remote site, this meant that the Master server would now be handling more SMTP traffic, along with OWA01 (the FE server at the site in question). We've noticed a reduction in performance, mainly around delays in sending/receiving mail, and outlook performance for mailboxes hosted on MBCL01. We identified that the inetinfo service is consuming ~20% processor resource and carrying out around 2billion IO reads per day. Also the SMTP log file sizes have increased dramatically and are now around 2GB a day (whereas they used to be around 200MB).
    >
    >After analysing the SMTP logs we found that the MB server was receiving 6.6 million link state updates (via the X-Link2State verb) from itself every day. WinRoute shows that all 6 members are unable to connect to the master.
    >
    >Other points worth noting - all servers are on the LAN (i.e. ports 691 and 25 are open on all boxes). I've got a feeling this problem has existed for a while, it's just manifested itself when we made the routing change, the security logs on the server have always logged hundreds of successful authentication attempts per second from the system account.
    >
    >I've checked most of the points in this article: http://support.microsoft.com/kb/832281 - everything looks ok.
     
    The last time I had to deal with a problem like that had to be at
    least seven or eight years ago!
     
    If you can't get the members of the RG to connect to the master, try
    moving the master to another machine. If that doesn't work, stop the
    RESvc services on each server in the RG and then restart them.
     
    Have you changed the FQDN on the SMTP Virtual Server? Is there a
    corresponding A record for the name in your internal DNS? Is there a
    SPN for the name?
     
    If the problem's caused by a stale route (or multiple stale routes)
    then the surest way to remove those routes is to shut down ALL the
    Exchange servers in the organization. Then restart the FE servers,
    then the BE servers. Because the link-state information is kept in
    memory you can't just reboot the machines one at a time. If you do
    that, and the member/master communications starts to work, you'll just
    replicate the stale routes from another machine. ALL the machines have
    to stopped before you restart any of them.
     
    Just be thankful you have only six servers. At that time we had 120
    machines and they were spread over very continent except Antarctica.
     
    ---
    Rich Matheisen
    MCSE+I, Exchange MVP
     

    --- Rich Matheisen MCSE+I, Exchange MVP
    • Marked as answer by Gaz Jones Monday, May 16, 2011 8:10 AM
    • Marked as answer by Gaz Jones Monday, May 16, 2011 8:10 AM
    Tuesday, May 10, 2011 9:30 PM
  • Excellent thanks for the reply Rich,  I'll get some downtime scheduled in asap and let you know how I get on.

    Re. the SPN - we have an A record for the address in DNS, I'll have to check the SPN's though.  Do you know whether the SPN should be assigned to the computer account of the Exchange Virtual Server (i.e. the clustered name) or the computer account of the cluster node?

    I'd expect to see more kerberos issues/authentication failures in the logs if it was a missing SPN but it's worth checking all the same.

    Thanks again, Gareth.

    Wednesday, May 11, 2011 7:20 AM
  • On Wed, 11 May 2011 07:20:34 +0000, Gaz Jones wrote:
     
    >
    >
    >Excellent thanks for the reply Rich, I'll get some downtime scheduled in asap and let you know how I get on.
    >
    >Re. the SPN - we have an A record for the address in DNS, I'll have to check the SPN's though. Do you know whether the SPN should be assigned to the computer account of the Exchange Virtual Server (i.e. the clustered name) or the computer account of the cluster node?
     
    The SPNs are in a multi-valued property of the server. The names
    should reflect whatever the SMTP VS uses to identify itself. Usually
    you'll have two SPNs for each name:
     
    SMTPSVC/<hostname>
    SMTPSVC/<fqdn>
     
    The setspn tool should tell you what SPNs are assigned to the machine
    with "setspn -L <servername>". You can add SPNs with "setspn -A <spn>
    <servername>".
     
    >I'd expect to see more kerberos issues/authentication failures in the logs if it was a missing SPN but it's worth checking all the same.
     
    You'd only have a problem with the machines that use Kerberos. E2K3
    should offer NTLM and LOGIN in addition to GSSAPI, but Exchange 2010
    wants to use GSSAPI and you need Kerberos for that.
     
    ---
    Rich Matheisen
    MCSE+I, Exchange MVP
     

    --- Rich Matheisen MCSE+I, Exchange MVP
    Thursday, May 12, 2011 2:16 AM
  • Hi Rich - thanks for your help with this.  Stopping the routing service on all servers and starting it back up one by one resolved the issue.  We also found that the Active - Active cluster EVS's were both running on the same node which probably didn't help.

    Cheers, Gareth.

    Monday, May 16, 2011 8:13 AM