none
Create Procedure sys.sp_MSget_repl_Commands in running status for hours.

    Question

  • We have  a database on SQL Server 2012 r2 replicating to another database in the same machine. We were doing some table changes so I am not sure if we messed anything. Our CPU has been close to 100%.

    When I run this query 

    SELECT sqltext.TEXT,
    req.session_id,
    req.status,
    req.command,
    req.cpu_time,
    req.total_elapsed_time
    FROM sys.dm_exec_requests req
    CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS sqltext
    order by total_elapsed_time desc 

    I see that total_elapsed_time = 22278963 and cpu_time = 1360

    Is this normal? 

    Wednesday, July 04, 2018 12:43 AM

All replies

  • Hi Pakshirajan

    If the same problem occur, please execute a query below, and show me [current_running_stmt] and [last_wait_type] and [wait_time_ms]

    SELECT TOP 100
         der.session_id
        ,DB_NAME(der.database_id) AS database_name
        ,des.host_name
        ,des.program_name
        ,der.status -- Status of the request. (background / running / runnable / sleeping / suspended)
        ,dest.text as command_text
        ,SUBSTRING(dest.text, der.statement_start_offset / 2, (
            CASE 
              WHEN der.statement_end_offset = - 1
                THEN LEN(CONVERT(NVARCHAR(MAX), dest.text)) * 2
              ELSE der.statement_end_offset
              END - der.statement_start_offset
         ) / 2) AS current_running_stmt
        ,datediff(s, der.start_time, GETDATE()) as time_sec
        ,wait_resource
        ,last_wait_type
        ,der.wait_time  as wait_time_ms
        ,der.blocking_session_id
        ,der.open_transaction_count
        ,der.command
        ,der.cpu_time
        ,deqp.query_plan
    FROM
        sys.dm_exec_requests der
    --JOIN sys.dm_exec_connections dec ON der.connection_id = dec.connection_id
    JOIN sys.dm_exec_sessions des ON des.session_id = der.session_id
    OUTER APPLY sys.dm_exec_sql_text(sql_handle) AS dest
    OUTER APPLY sys.dm_exec_query_plan(plan_handle) AS deqp
    WHERE
        des.is_user_process = 1
    AND datediff(s, der.start_time, GETDATE()) >= 1
    ORDER BY
        datediff(s, der.start_time, GETDATE()) DESC

    If you update/insert/delete many records of tables which are replication target, sys.sp_MSget_repl_Commands may be slow and use a lot of CPU resource.

    I doubt query plan of sys.sp_MSget_repl_Commands. Because it may use query hint.


    Wednesday, July 04, 2018 3:32 AM
  • Thanks for your help. Here is the only running statement.

    Others are in suspended or runnable status. Three queries in suspended status has a higher value for wait_time_ms but they all are in my primary database. This query in 'running' status is in distribution database. Because of the high CPU% everything seems to go very slow as you would expect. 


     select rc.xact_seqno, rc.partial_command, rc.type,           rc.command_id, rc.command, rc.hashkey,          -- extra columns for the PeerToPeer resultset    -- originator_id, srvname, dbname, originator_publication_id, originator_db_version, originator_lsn    NULL, NULL, NULL, NULL, NULL, NULL          from                              MSrepl_commands rc with (nolock, INDEX(ucMSrepl_commands))           JOIN dbo.MSsubscriptions s with (INDEX(ucMSsubscriptions))           -- At end, we use the FASTFIRSTROW option which tends to force           -- a nested inner loop join driven from MSrepl_commands           ON (rc.article_id = s.article_id)          where          s.agent_id = @agent_id and          rc.publisher_database_id = @publisher_database_id and          rc.xact_seqno > @last_xact_seqno and          rc.xact_seqno <= @max_xact_seqno and          (rc.type & @snapshot_bit) <> @snapshot_bit and    (rc.type & ~@snapshot_bit) not in ( 37, 38 )    and (@compatibility_level >= 9000000       or (rc.type & ~@postcmd_bit) not in (47))          order by rc.xact_seqno, rc.command_id asc          OPTION (FAST 1)    

    last_wait_type

    PREEMPTIVE_OS_WAITFORSINGLEOBJECT

    wait_time_ms

    1136

    Wednesday, July 04, 2018 5:12 AM
  • Thank you for your reply.

    I have experienced the same problem.(same current_running_stmt, but different last_wait_type.)

    That wait type may mean [that SQL Server was finished processing its data, but it was waiting for the calling external application to accept/consume the results.]

    (please refer this url) https://community.spiceworks.com/topic/2097788-preemptive_os_waitforsingleobject-during-ssis

    My suggestion is to get rid of query hint [OPTION (FAST 1)] from query above.(so alter sys.sp_MSget_repl_Commands)

    But I have not tried it yet, and it is system stored procedure.

    So altering system stored procedure(sys.sp_MSget_repl_Commands) is risky.

    If you try it, please be careful, and i don't take responsibility.

    Wednesday, July 04, 2018 5:42 AM
  • It is giving me Invalid object name 'sys.sp_MSget_repl_commands'.

    If I drop sys. then it would change the owner of the object, correct? I am in distribution database.

    Wednesday, July 04, 2018 6:19 AM
  • Oh, I'm sorry. I've tried and got same error.

    I found We can't modify system stored procedure.

    https://stackoverflow.com/questions/19599264/is-it-possible-to-modify-system-stored-procedure

    I give up, sorry.

    Wednesday, July 04, 2018 7:16 AM
  • Hi Pakshirajan,

     

    According to your description, my understanding is that the procedure sys.sp_MSget_repl_Commands is running for a long time. If anything is misunderstand, please tell me.

     

    It is not a normal situation. It seems that your issue is about the drive of your distribution database, please check whether the drive with your distribution database is a bottleneck. Stored procedure sp_MSget_repl_commands is related to reader thread. And the distribution tables are too large also may be the reason for the issue. Besides you could also check your indexes on the replication tables for fragmentation.

     

    Best Regards,

    Emily


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Wednesday, July 04, 2018 9:25 AM
  • This stored procedure is part of the log reader agent. It should be running for long periods of time and does not have a high resource footprint.

    If you are very concerned about this, just schedule your log reader agent. Ensure you have it run frequently - ie every 5 or 10 minutes as it has an impact on your transaction log size.

    Wednesday, July 04, 2018 1:08 PM
    Moderator
  • Emily - I started rebuilding indexes on the subscriber and was able to rebuild all the indexes fine.

    In one of the tables in the Publisher all the rebuild indexes are failing with "Lock request timeout period exceeded"

    I do not know if this is the cause or the effect.

    Hilary - The reason I started looking this is because everyone was complaining about slowness. The SQL CPU usage was 90 to 100%. 


    • Edited by Pakshirajan Wednesday, July 04, 2018 4:55 PM
    Wednesday, July 04, 2018 4:29 PM
  • I executed the below query to check the status of replication and noticed that und.UndelivCmdsInDistDB [UndistCom] has a value of 1315369. How to proceed?

    SELECT 
    (CASE  
        WHEN mdh.runstatus =  '1' THEN 'Start - '+cast(mdh.runstatus as varchar)
        WHEN mdh.runstatus =  '2' THEN 'Succeed - '+cast(mdh.runstatus as varchar)
        WHEN mdh.runstatus =  '3' THEN 'InProgress - '+cast(mdh.runstatus as varchar)
        WHEN mdh.runstatus =  '4' THEN 'Idle - '+cast(mdh.runstatus as varchar)
        WHEN mdh.runstatus =  '5' THEN 'Retry - '+cast(mdh.runstatus as varchar)
        WHEN mdh.runstatus =  '6' THEN 'Fail - '+cast(mdh.runstatus as varchar)
        ELSE CAST(mdh.runstatus AS VARCHAR)
    END) [Run Status], 
    mda.subscriber_db [Subscriber DB], 
    mda.publication [PUB Name],
    right(left(mda.name,LEN(mda.name)-(len(mda.id)+1)), LEN(left(mda.name,LEN(mda.name)-(len(mda.id)+1)))-(10+len(mda.publisher_db)+(case when mda.publisher_db='ALL' then 1 else LEN(mda.publication)+2 end))) [SUBSCRIBER],
    CONVERT(VARCHAR(25),mdh.[time]) [LastSynchronized],
    und.UndelivCmdsInDistDB [UndistCom], 
    mdh.comments [Comments], 
    'select * from distribution.dbo.msrepl_errors (nolock) where id = ' + CAST(mdh.error_id AS VARCHAR(8)) [Query More Info],
    mdh.xact_seqno [SEQ_NO],
    (CASE  
        WHEN mda.subscription_type =  '0' THEN 'Push' 
        WHEN mda.subscription_type =  '1' THEN 'Pull' 
        WHEN mda.subscription_type =  '2' THEN 'Anonymous' 
        ELSE CAST(mda.subscription_type AS VARCHAR)
    END) [SUB Type],

    mda.publisher_db+' - '+CAST(mda.publisher_database_id as varchar) [Publisher DB],
    mda.name [Pub - DB - Publication - SUB - AgentID]
    FROM distribution.dbo.MSdistribution_agents mda 
    LEFT JOIN distribution.dbo.MSdistribution_history mdh ON mdh.agent_id = mda.id 
    JOIN 
        (SELECT s.agent_id, MaxAgentValue.[time], SUM(CASE WHEN xact_seqno > MaxAgentValue.maxseq THEN 1 ELSE 0 END) AS UndelivCmdsInDistDB 
        FROM distribution.dbo.MSrepl_commands t (NOLOCK)  
        JOIN distribution.dbo.MSsubscriptions AS s (NOLOCK) ON (t.article_id = s.article_id AND t.publisher_database_id=s.publisher_database_id ) 
        JOIN 
            (SELECT hist.agent_id, MAX(hist.[time]) AS [time], h.maxseq  
            FROM distribution.dbo.MSdistribution_history hist (NOLOCK) 
            JOIN (SELECT agent_id,ISNULL(MAX(xact_seqno),0x0) AS maxseq 
            FROM distribution.dbo.MSdistribution_history (NOLOCK)  
            GROUP BY agent_id) AS h  
            ON (hist.agent_id=h.agent_id AND h.maxseq=hist.xact_seqno) 
            GROUP BY hist.agent_id, h.maxseq 
            ) AS MaxAgentValue 
        ON MaxAgentValue.agent_id = s.agent_id 
        GROUP BY s.agent_id, MaxAgentValue.[time] 
        ) und 
    ON mda.id = und.agent_id AND und.[time] = mdh.[time] 
    where mda.subscriber_db<>'virtual' -- created when your publication has the immediate_sync property set to true. This property dictates whether snapshot is available all the time for new subscriptions to be initialized. This affects the cleanup behavior of transactional replication. If this property is set to true, the transactions will be retained for max retention period instead of it getting cleaned up as soon as all the subscriptions got the change.
    --and mdh.runstatus='6' --Fail
    --and mdh.runstatus<>'2' --Succeed
    order by mdh.[time]

    Wednesday, July 04, 2018 9:54 PM
  • When I executed the below query, I see. the output given below

    So it seems this publisher is not replicating to 3 different subscribers?

    name article undelivcmdsindistdb delivcmdsindistdb
    One Table1 1354446 3040
    One Table2 18769 0
    Two Table3 7808 0
    Three Table3 7808 0
    Three Table4 2631 0
    Two Table4 2631 0
    One Table5 1835 0
    One Table6 961 0
    One Table7 331 0
    One Table8 299 0
    One Table9 233 0

    ***************************************

    if exists (select * from tempdb.sys.tables where name = '##msdstatus')

    drop table ##msdstatus

    if exists (select * from tempdb.sys.tables where name = '##msdagents')
    drop table ##msdagents

    if exists (select * from tempdb.sys.tables where name = '##msdarticles')
    drop table ##msdarticles

    use Distribution
    go
    select * into ##msdstatus from msdistribution_status  
    select * into ##msdagents from msdistribution_agents
    select * into ##msdarticles from msarticles
    select ag.name, ar.article, s.undelivcmdsindistdb, s.delivcmdsindistdb
    from ##msdstatus s,
        ##msdagents ag,
        ##msdarticles ar
    where s.agent_id = ag.id and s.article_id = ar.article_id
    order by undelivcmdsindistdb desc

    Thursday, July 05, 2018 1:49 AM
  • High CPU in the distribution database can be a result of having a high retention period. It should be set to 2 or 3 days.

    It can also be caused by large numbers of replication commands in your distribution database which are not being cleaned up.

    Can you verify that you are not using the immediate sync option? If you are - turn it off.

    I don't believe your query will be helpful in determining what is undistributed in the queue. Use replication monitor to get an idea of this.

    Large numbers of rows from a single table can mean that this table has text/image/varchar(max)/varbinary(max) columns which will consume many rows.

    Finally there could be a missing index on one of the replication system tables, or even an index on the replication system tables which is causing the high cpu.

    Thursday, July 05, 2018 1:49 PM
    Moderator
  • In one of the tables in the Publisher all the rebuild indexes are failing with "Lock request timeout period exceeded"

    Hi Pakshirajan,

    The error message "Lock request timeout period exceeded" means the table is accessed and the rebuild index task can't  get exlcusive lock on the table. You could try to rebuild the index when the table is not accessed.

    Best Regards,
    Emily

    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Thursday, July 05, 2018 2:51 PM
  • Shut down the log reader agent and the distribution agent. disable the distribution clean up task.

    Rebuild your index and then re-enable all of these jobs.

    Thursday, July 05, 2018 5:01 PM
    Moderator
  • I disabled the following 2 jobs

    1. Agent history cleanup: distribution and
    2. Distribution cleanup: distribution

    Stopped the log reader, stopped the distribution agent. Rebuilt the indexes on the publisher articles.

    Created an index on the subscription database for one of the replicated tables because another replicated table had a trigger referring to the table (for which I created the index). The index was present on the Publisher but not on the subscriber.

    After creating it restarted the log reader and distribution agent and noticed the latencey go down from 1.18.09:30 to 0.0.0:10 in 10 minutes.

    After that I re-enabled the 2 disabled jobs mentioned above.

    Thanks for all the help.


    Friday, July 06, 2018 2:25 AM
  • Hi Pakshirajan,

     

    We are glad to hear that the latencey has gone down. Could you please kindly mark helpful replies as answers. By doing so, it will benefit all community members who are having this similar issue.  Your contribution is highly apprecaited.

     

    Best Regards,

    Emily


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com


    Friday, July 06, 2018 2:39 AM