none
Event ID 2153 is logged on database availability group member

    Question

  • Hi all,

    I found a similar topic in forum but it doesn't satisfy me.
    My environment : 2 Exchange server 2013 CU 19 members (CAS + Mailbox both , Windows server 2008 R2) in 1 DAG , server2 is newer and has better hardware than server1 , each server has 128 GB RAM , both performance are good.
    5 mailbox databases : 1 to 5 , 4 dbs active on server2 , 1 db active on server1.

    Server 1 has the below error (frequently)

    Log Name:      Application
    Source:        MSExchangeRepl
    Date:          4/24/2018 9:22:28 AM
    Event ID:      2153
    Task Category: Service
    Level:         Error
    Keywords:      Classic
    User:          N/A
    Computer:      server1.mydomain.com
    Description:
    The log copier was unable to communicate with server 'server2.mydomain.com'. The copy of database 'Mailbox Database 02\server1' is in a disconnected state. The communication error was: An error occurred while communicating with server 'server2'. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. The copier will automatically retry after a short delay.
    Event Xml:
    <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
      <System>
        <Provider Name="MSExchangeRepl" />
        <EventID Qualifiers="49156">2153</EventID>
        <Level>2</Level>
        <Task>1</Task>
        <Keywords>0x80000000000000</Keywords>
        <TimeCreated SystemTime="2018-04-24T02:22:28.000000000Z" />
        <EventRecordID>15439691</EventRecordID>
        <Channel>Application</Channel>
        <Computer>server1.mydomain.com</Computer>
        <Security />
      </System>
      <EventData>
        <Data>Mailbox Database 02\server1</Data>
        <Data>server2.mydomain.com</Data>
        <Data>An error occurred while communicating with server 'server2'. Error: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine.</Data>
      </EventData>
    </Event>

    When this error appear, I can view database copy status on ecp , it shows:

    Servers 
    
    server2
    server1
    
    Database copies 
    
    Mailbox Database db02\server2
    Active Mounted
    Copy queue length:  0
    Content index state:  Healthy 
    
    View details 
    
    Mailbox Database db02\server1
    Passive Disconnected and Healthy
    Copy queue length:  5
    Content index state:  Healthy 

    It happens in very short time, then every thing back to normal but it happens very frequently, it happens with all 4 dbs active on server2, no error in "Recent Cluster Events"

    The last time I switch over, the failover feature works fine : I upgrade both server from CU 7 to CU 19 recently , when I put one server to maintenance mode and mount all databases to remain server, it works fine (I can send/receive when one server is down)

    When they are still CU 7 before, same errors happened but less frequently like a few weeks, now it happens a few minutes.

    Should I worry about it ? Please give me some advice, thank you very much.


    • Edited by Jack Chuong Tuesday, April 24, 2018 2:48 AM
    Tuesday, April 24, 2018 2:42 AM

All replies

  • All I can suggest is to examine everything in the network between the two servers.

    Ed Crowley MVP "There are seldom good technological solutions to behavioral problems."
    Celebrating 20 years of providing Exchange peer support!

    Tuesday, April 24, 2018 5:50 PM
    Moderator
  • Agree with Ed.

    Besides, run the command Test-ReplicationHealth to check all aspects of replication and replay. 

    Regards,

    Manu Meng


    Please remember to mark the replies as answers if they helped. If you have feedback for TechNet Subscriber Support, contact tnsf@microsoft.com.

    Click here to learn more. Visit the dedicated forum to share, explore and talk to experts about Microsoft Teams.

    Wednesday, April 25, 2018 10:07 AM
    Moderator
  • Thank you for your replies,

    Here some information :

    Get-DatabaseAvailabilityGroupNetwork | fl
    RunspaceId         : 45e96b14-c77e-44a9-91c1-061716f53a7f
    Name               : MapiDagNetwork
    Description        :
    Subnets            : {{192.168.2.0/24,Up}}
    Interfaces         : {{server1,Up,192.168.2.6}, {server2,Up,192.168.2.8}}
    MapiAccessEnabled  : True
    ReplicationEnabled : False
    IgnoreNetwork      : False
    Identity           : DAG01\MapiDagNetwork
    IsValid            : True
    ObjectState        : New
    
    RunspaceId         : 45e96b14-c77e-44a9-91c1-061716f53a7f
    Name               : ReplicationDagNetwork01
    Description        :
    Subnets            : {{10.10.10.0/24,Up}}
    Interfaces         : {{server1,Up,10.10.10.6}, {server2,Up,10.10.10.8}}
    MapiAccessEnabled  : False
    ReplicationEnabled : True
    IgnoreNetwork      : False
    Identity           : DAG01\ReplicationDagNetwork01
    IsValid            : True
    ObjectState        : New
    
    RunspaceId         : 45e96b14-c77e-44a9-91c1-061716f53a7f
    Name               : ReplicationDagNetwork02
    Description        :
    Subnets            : {{fe80::/64,Misconfigured}}
    Interfaces         : {}
    MapiAccessEnabled  : False
    ReplicationEnabled : False
    IgnoreNetwork      : True
    Identity           : DAG01\ReplicationDagNetwork02
    IsValid            : True
    ObjectState        : New

    ReplicationDagNetwork01: DAG01 use subnet 10.10.10.0 for replication between 2 servers
    ReplicationDagNetwork02 : ignored but I cannot delete it, DAG will create it automatically so I just leave it there.

    Test-ReplicationHealth results are good at both servers:

    Server          Check                      Result     Error
    ------          -----                      ------     -----
    server2       ClusterService             Passed
    server2       ReplayService              Passed
    server2       ActiveManager              Passed
    server2       TasksRpcListener           Passed
    server2       TcpListener                Passed
    server2       ServerLocatorService       Passed
    server2       DagMembersUp               Passed
    server2       MonitoringService          Passed
    server2       ClusterNetwork             Passed
    server2       QuorumGroup                Passed
    server2       FileShareQuorum            Passed
    server2       DatabaseRedundancy         Passed
    server2       DatabaseAvailability       Passed
    server2       DBCopySuspended            Passed
    server2       DBCopyFailed               Passed
    server2       DBInitializing             Passed
    server2       DBDisconnected             Passed
    server2       DBLogCopyKeepingUp         Passed
    server2       DBLogReplayKeepingUp       Passed
    
    Server          Check                      Result     Error
    ------          -----                      ------     -----
    server1       ClusterService             Passed
    server1       ReplayService              Passed
    server1       ActiveManager              Passed
    server1       TasksRpcListener           Passed
    server1       TcpListener                Passed
    server1       ServerLocatorService       Passed
    server1       DagMembersUp               Passed
    server1       MonitoringService          Passed
    server1       ClusterNetwork             Passed
    server1       QuorumGroup                Passed
    server1       FileShareQuorum            Passed
    server1       DatabaseRedundancy         Passed
    server1       DatabaseAvailability       Passed
    server1       DBCopySuspended            Passed
    server1       DBCopyFailed               Passed
    server1       DBInitializing             Passed
    server1       DBDisconnected             Passed
    server1       DBLogCopyKeepingUp         Passed
    server1       DBLogReplayKeepingUp       Passed
    I also try command "Get-DatabaseAvailabilityGroup | Select -ExpandProperty:Servers | Test-ReplicationHealth | Where {$_.Result.Value -ne "Passed"} | Format-List" on both servers, the results are null (of course). However, when event id 2153 happens , in very short time , like I said , I will get result on both servers :
    Server          Check                      Result     Error
    ------          -----                      ------     -----
    server1       DBDisconnected             *FAILED*   Continuous Replication for database 'Mailbox Database 02..
    Can you guide me how to "examine everything in the network between the two servers" ? What should I do ?
    Thursday, April 26, 2018 2:28 AM
  • This error occurs only on server 1, not found on server 2, frequently (several minutes) , and only during working hours in other words when the mail traffic is heavy.



    Wednesday, May 09, 2018 2:07 AM
  • That sounds like during your peak hours the network is getting saturated, or is getting close to saturated. You may need to look in Perfmon at the perfcounters for the NIC's and see if one of them is dropping packets.
    Monday, May 21, 2018 8:47 PM
  • The problem is fixed, I think.

    The root cause is server1 disk I/O is very poor, I recognize it after looking in Perfmon at the perfcounters for Disk, server1 disk I/O performance = 1/10 comparing to server2, it cause serious problems on peak hours, cluster down, mailbox database dismount and messages stuck in queue or delivery slowly.

    So I remove it from system completely , reconfigure RAID (at hardware level) , reinstall it then join it to system again.

    Now it's performance is much better, the event id 2153 still appears but with less frequency : several times a day instead of thousands every hour, it have run smoothly for 2 weeks, I keep monitoring it but so far so good.




    Monday, May 28, 2018 7:44 AM