locked
CPU Usage 100% on SQL OpsMgr DB and DW server RRS feed

Answers

  • 31552 is a failure of standarddataset maintenance.

    The thing you are seeing - is that htis is failing.... because it is timing out (normal).  This job runs every 10 minutes I believe, and has a 5 minute timeout.... the problem is, it is behind and trying to catch up but cant.

    This can be caused by poor performance of the SQL server, backlog of data in stagin tables, failure of the RMS to perform maintenance due to a prolonged outage, excessively large tables in the warehouse cause by bugs or grooming failures, excessive overcollection of events, perf, statechanges, etc...

    I recommend opening a support case with PSS on this - because it will likely not fix itself. 

    Generally the fix is to disable the automatic maintenance for the specific dataset that is failing, then running it maunally... which will HAMMER the DB and take several hours.... then enabling the built in maintenance again.  However, I recommend getitng PSS to assist with this because they can dig deeper into root cause and perhaps identify other issues.

    • Marked as answer by Nicholas Li Thursday, May 6, 2010 3:24 AM
    Wednesday, May 5, 2010 3:34 AM

All replies

  • Hi,

    Just need more precision :

    Which version of OpsMgr, version of SQL Server, whera are Database files ?

    Regards,

     


    JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/
    Wednesday, April 21, 2010 10:28 PM
  • Hey Jeff, Thanks for the reply,

     

    Here is a basic run down of my config;

    ·         Root Management Server : scom-1a

    o    ( 2xQuad Core ,14GB RAM ,278 GB Mirrored Local Disk)

    o    Windows Enterprise 2008 R2 x64

    o    SCOM 2007 R2 Enterprise x64

    ·         RMS Database Server : scomdb-1a

    o    OperationsManagerDB

    o    OperationsManagerDW

    o    F5 DB

    o    Windows Enterprise  2003 x64

    o    SQL 2005 SP3 x64

     

    Thursday, April 22, 2010 3:03 PM
  • Hi,

     

    Not sure it will help you, I've got the same problem few monthes ago

    with SQL 2005 SP3 on a SQL Server with HP SAN

    due to an update with driver for storage system

    Regards,


    JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/
    Thursday, April 22, 2010 5:44 PM
  • Hey Jeff,

    I can have a look at that.  The SQL server has attached storage via SAS 5E, so it's possible the drivers were recently updated.  Although we use this SQL server for other production data and it runs fine, it's just when SCOM is accessing the SQL Server that it causes the CPU to pin at 100%.  The process that is running when the CPU pins is OperationsManagerDW.dbo.StandardDatasetMaintenance;1.

     

    Thx,

    Jeff

    Thursday, April 22, 2010 6:50 PM
  • Quick update, the drivers have not been updated and are the same, so not sure if that'd be the issue. However I'll keep this in mind incase it's a situation where we need to update the drivers.. Thx
    Thursday, April 22, 2010 6:54 PM
  • Hi,

    Same problem with OperationsManagerDW.dbo.StandardDatasetMaintenance

    Resolution was to reinstall SQL server 2005 with SP2

    Regards

     


    JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/
    Thursday, April 22, 2010 7:08 PM
  • Hmm, That's super interesting.  I'll look at trying this.  Will take me a bit to do so as we'll need to take the server down during our block time... As well I am away next week, but will be dialing in to check on things, so will let you know!
    Thursday, April 22, 2010 7:46 PM
  • Hi,

     

    Additionally, please also try the following updates and see how it works:

     

    FIX: CPU usage increases when you run a query that uses a string comparison function on a computer that has many processors after you upgrade to SQL Server 2005 Service Pack 3 or to SQL Server 2008

    http://support.microsoft.com/kb/970823

     

    System Center Operations Manager 2007 R2 Cumulative Update 1 Release Notes

    http://support.microsoft.com/kb/974144

     

    Hope this helps. Thanks.


    Nicholas Li - MSFT
    Friday, April 23, 2010 6:50 AM
  • Hey Nicholas,

    Actually I think I'll go this route first as our SQL server has 16 cores on it.  And I checked and we haven't applied the 970823 hotfix.

    Will try this, this Sunday and let you know,

    Thx a bunch

    Friday, April 23, 2010 2:48 PM
  • Actually, the specified update, 970823, has been applied, there is a newer accumulative update for this which we've applied.

    As for the SCOM R2 Cumulative update 1 we have not, so looking in to this one,

    Thx

    Friday, April 23, 2010 3:26 PM
  • Hey Nicholas,

    Okay, so I've now applied the SCOM R2 Cumulative update 1.  It actually applied without having reboots, however I rebooted my RMS server anyways. 

    Still no luck, the problem still exists with the CPU pinning at 100%.

    I have tried the following too; http://social.technet.microsoft.com/Forums/en/operationsmanagergeneral/thread/729734c3-6780-4af3-989e-319e7d02592b

    Again this did not help.  I am getting the errors listed in the above link however....  Either way I am stumped, and I have a support call open with Premier Support and they aren't getting anywhere either.  Worries me because we've invested a lot of money in to this to help monitor some critical apps within our organization.

    Still willing to try anything, let me know.

    PS I am away next week, but will dial in periodically to check.

    Thx everyone.

    Saturday, April 24, 2010 4:04 PM
  • Note I have found this error within ops manager event log.

    Source of Health Service Module and an Event Number of 10103.

    I did some browsing on this error and it typically will occur after you've moved your Data Warehouse DB to another server.  We haven't moved the Database, but wondering if anyone has experienced this error and has fixed it?

    Thx,

    Jeff

    Sunday, April 25, 2010 11:01 PM
  • Hi,

     

    Thank you for your update. Regarding the Event ID 10103, please try the following:

     

    1.    Please check SQL Reporting Services’s configuration; or you can reconfigure SRS with default settings and restart OpsMgr services.

     

    2.    Verify the Data Warehouse Run As Profiles settings, they should be default and configured properly.

     

    Event ID 2115 is logged, and a management server generates an "unable to write data to the Data Warehouse" alert in System Center Operations Manager 2007

    http://support.microsoft.com/kb/945946

     

    Thanks.


    Nicholas Li - MSFT
    Monday, April 26, 2010 3:52 AM
  • Hey Nicholas,

    Thanks for the reference.  I checked the SRS and its running fine.  I also went through point number 2 and it is configured as it should.

    Again, a bit of history that may help things...

    everything had been running fine since November of 2009 till roughly two weeks ago.  About two weeks ago we had a DC go astray with regards to it's time being off by 5 minutes.  This affected SCOM as the SCOM server was off by 5 minutes (same as the broken DC), however the SQL server was the appropriate time.  With these two servers being off SCOM was throwing some time specific errors, so I sorted the time issues we had, and those errors went away.  With the time difference sorted SCOM started reporting new errors, the 31552 events and that is where we are today.

    Note, the time difference happened over the weekend so the time difference issue occured over 2 days.

    Maybe this will help, but not sure...

    Thx

    Monday, May 3, 2010 9:40 PM
  • 31552 is a failure of standarddataset maintenance.

    The thing you are seeing - is that htis is failing.... because it is timing out (normal).  This job runs every 10 minutes I believe, and has a 5 minute timeout.... the problem is, it is behind and trying to catch up but cant.

    This can be caused by poor performance of the SQL server, backlog of data in stagin tables, failure of the RMS to perform maintenance due to a prolonged outage, excessively large tables in the warehouse cause by bugs or grooming failures, excessive overcollection of events, perf, statechanges, etc...

    I recommend opening a support case with PSS on this - because it will likely not fix itself. 

    Generally the fix is to disable the automatic maintenance for the specific dataset that is failing, then running it maunally... which will HAMMER the DB and take several hours.... then enabling the built in maintenance again.  However, I recommend getitng PSS to assist with this because they can dig deeper into root cause and perhaps identify other issues.

    • Marked as answer by Nicholas Li Thursday, May 6, 2010 3:24 AM
    Wednesday, May 5, 2010 3:34 AM
  • Hey Kevin,

    Thanks for getting back to me on this.  You are right in what you say.  I've been working with PSS since it broke and they've determined it to possibly be a disk IO issue on the SQL server, and due to the backlog from that weekend of events, the SQL server is having troubles keeping up.

    What we are doing through PSS' suggestion, is building a new SQL server specific for SCOM and going to move the databases over.

    Through this whole experience I am a bit dissappointed as we had some one from Microsoft come in to assist and approve our design, but it appears our SQL server didn't really match the proper requirement.

    Anyways, thanks again for everyones input and help, I really do appreciate it!

    Jeff

    Tuesday, May 11, 2010 2:51 PM