none
Unhandled exception is encountered. [Exception Code = 3221225477] During Replay event dispatch RRS feed

  • Question


  • Dear all,

    i have been tasked to make a replay of production workload on the test environment.

    We have built a replay topology of One db server (31T1) one controller server 36T1 and four replay clients (36T2 -36T5) 

    The replay (3hour collection interval) itself is rather big, 71871432 replayable events.

    We found replay failed with following issue (it seems that one of the client didn't responded)

    From Controller log:

    Event replay in progress. Detailed options:

    2018-07-22 17:42:09:695 OPERATIONAL  [Controller Service]       Target DB Server: [31T].

    2018-07-22 17:42:09:695 OPERATIONAL  [Controller Service]       Controller Working Directory: [P:\Dreplay].

    2018-07-22 17:42:09:695 OPERATIONAL  [Controller Service]       Generate Result Trace: [Yes].

    2018-07-22 17:42:09:695 OPERATIONAL  [Controller Service]       Sequencing Mode: [SYNC].

    2018-07-22 17:42:09:711 OPERATIONAL  [Controller Service]       Connect Time Scale: [100].

    2018-07-22 17:42:09:711 OPERATIONAL  [Controller Service]       Think Time Scale: [100].

    2018-07-22 17:42:09:711 OPERATIONAL  [Controller Service]       Healthmon Polling Interval: [60].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Query Timeout: [3600].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Data Provider Type: [ODBC].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Threads Per Client: [300].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Record Row Count: [Yes].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Record Result Set: [No].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Connection Pooling Enabled: [Yes].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Stress Scale Granularity: [Connection].

    2018-07-22 17:42:09:727 OPERATIONAL  [Controller Service]       Replay Clients: [36T2, 36T3, 36T4, 36T5].

    2018-07-22 17:42:10:070 OPERATIONAL  [Controller Service]  Event dispatch in progress.

    2018-07-23 11:51:49:361 CRITICAL     [Controller Service]  [0xC8230012] No heartbeat has been received from client 36T4 within 9 seconds. Cancelling replay.

    2018-07-23 11:51:49:361 CRITICAL     [Controller Service]  **** Critical Error ****

    2018-07-23 11:51:49:361 CRITICAL     [Controller Service]  Machine Name: 36T1

    2018-07-23 11:51:49:361 CRITICAL     [Controller Service]  Error Code: 0xC8502301

    2018-07-23 11:51:50:392 CRITICAL     [Controller Service]  Fail to send stop command to client with error code 0x800706BA

    2018-07-23 11:51:50:392 OPERATIONAL  [Controller Service]  Event replay completed.

    2018-07-23 11:51:50:392 OPERATIONAL  [Controller Service]  Elapsed time: 0 day(s), 18 hour(s), 9 minute(s), 40 second(s).

    2018-07-23 11:54:55:724 OPERATIONAL  [Controller Service]  Windows service "Microsoft SQL Server Distributed Replay Controller" is stopping...

    From Client log:

    2018-07-18 10:34:25:875 OPERATIONAL  [Client Service]      Windows service "Microsoft SQL Server Distributed Replay Client" has started under service account "Anonymized". Process ID is 1472.

    2018-07-18 10:34:25:875 OPERATIONAL  [Client Service]      Time Zone: Central Europe Standard Time.

    2018-07-18 10:34:25:891 OPERATIONAL  [Client Service]      Controller name is "36T1".

    2018-07-18 10:34:25:891 OPERATIONAL  [Client Service]      Working directory is "C:\Program Files (x86)\Microsoft SQL Server\110\Tools\DReplayClient\WorkingDir".

    2018-07-18 10:34:25:891 OPERATIONAL  [Client Service]      Result directory is "C:\Program Files (x86)\Microsoft SQL Server\110\Tools\DReplayClient\ResultDir".

    2018-07-18 10:34:25:891 OPERATIONAL  [Client Service]      Heartbeat Frequency(ms): 3000

    2018-07-18 10:34:25:891 OPERATIONAL  [Client Service]      Heartbeats Before Timeout: 3

    2018-07-18 10:34:25:969 OPERATIONAL  [Client Service]      Registered with controller "36T1".

    2018-07-22 17:42:11:438 OPERATIONAL  [Client Replay]       Event dispatch in progress.

    2018-07-23 11:51:35:542 OPERATIONAL  [Common]              Unhandled exception is encountered. [Exception Code = 3221225477]

    2018-07-23 11:51:35:542 OPERATIONAL  [Common]              Invoking dump.

    2018-07-23 11:51:35:542 OPERATIONAL  [Common]              Dump is being written to C:\Program Files (x86)\Microsoft SQL Server\110\Tools\DReplayClient\\Log

    2018-07-23 11:51:39:151 OPERATIONAL  [Common]              Dump request sent to sqldumper.

    2018-07-23 11:51:39:151 OPERATIONAL  [Common]              Service terminating.

    From mdump was able to get this:

    Loading unloaded module list

    .

    This dump file has an exception of interest stored in it.

    The stored exception information can be accessed via .ecxr.

    (5c0.1590): Access violation - code c0000005 (first/second chance not available)

    eax=00000000 ebx=0071dd98 ecx=7770b160 edx=7770b160 esi=00000000 edi=0000030c

    eip=776ced0c esp=466cf0d4 ebp=466cf144 iopl=0         nv up ei pl nz na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206

    ntdll!NtWaitForSingleObject+0xc:

    776ced0c c20c00          ret     0Ch

    0:324> .ecxr

    eax=2fe89000 ebx=45950000 ecx=7770b160 edx=7770b160 esi=45950000 edi=7770b160

    eip=7770b173 esp=466cfb34 ebp=466cfb48 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246

    ntdll!RtlpQueryProcessDebugInformationRemote+0x13:

    7770b173 ff7620          push    dword ptr [esi+20h]  ds:002b:45950020=????????

    0:324> kC 1000

     #

    00 ntdll!RtlpQueryProcessDebugInformationRemote  <-- This seems to be the point of failure

    01 kernel32!BaseThreadInitThunk

    02 ntdll!__RtlUserThreadStart

    03 ntdll!_RtlUserThreadStart

    To be honest I am not good at dump analysis :-(

    We have observed that several hours before failure, controller stopped dispatching events.

    2018-07-23 14:40:03:189 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:41:03:196 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:42:03:197 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:43:03:198 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:44:03:199 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:45:03:200 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:46:03:201 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:47:03:202 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:48:03:203 Info DReplay    29074079 events have been dispatched.

    2018-07-23 14:49:03:204 Info DReplay    29074079 events have been dispatched.

    Controller and Client config is default. Replay cfg looks as follows:

    <?xml version='1.0'?>  
    <Options>
          <ReplayOptions>
              <Server>31T</Server>
              <SequencingMode>synchronization</SequencingMode>
      <ConnectTimeScale>100</ConnectTimeScale>
              <ThinkTimeScale>100</ThinkTimeScale>
              <HealthmonInterval>600</HealthmonInterval>
              <QueryTimeout>3600</QueryTimeout> <-- changed just for the test, but no success original value was 60
      <ThreadsPerClient>300</ThreadsPerClient>   
              <EnableConnectionPooling>Yes</EnableConnectionPooling>
      <StressScaleGranularity>SPID</StressScaleGranularity>
            </ReplayOptions>
          <OutputOptions>
              <ResultTrace>
                  <RecordRowCount>Yes</RecordRowCount>
                  <RecordResultSet>No</RecordResultSet>
              </ResultTrace>
          </OutputOptions>
      </Options>  

    Is there anybody who could guide me where could be an issue? There is not too much places where to look :-(

    Thanks a lot, in forward!





    Monday, July 23, 2018 12:56 PM