Answered by:
CPU Usage 100% on SQL OpsMgr DB and DW server

Question
-
My OpsMgrDb and DW SQL Server maxes the sqlservr.exe service at 100% for about 5 minutes, then drops down 0%-5% for 1 minute, then back to 100% for 5 minutes. Constantly goes through this cycle.
If I run exec sp_who2 I get about 20 Runnable Insert process calls to the OperationsManagerDW, and one SUSPENDED insert command.
OpsMgr also logs a 31552 event.
Help!
Since this problem has been happening, I've reduced the Grooming Ops DB as per the following article to see if that'd help, but no luck,
Wednesday, April 21, 2010 5:51 PM
Answers
-
31552 is a failure of standarddataset maintenance.
The thing you are seeing - is that htis is failing.... because it is timing out (normal). This job runs every 10 minutes I believe, and has a 5 minute timeout.... the problem is, it is behind and trying to catch up but cant.
This can be caused by poor performance of the SQL server, backlog of data in stagin tables, failure of the RMS to perform maintenance due to a prolonged outage, excessively large tables in the warehouse cause by bugs or grooming failures, excessive overcollection of events, perf, statechanges, etc...
I recommend opening a support case with PSS on this - because it will likely not fix itself.
Generally the fix is to disable the automatic maintenance for the specific dataset that is failing, then running it maunally... which will HAMMER the DB and take several hours.... then enabling the built in maintenance again. However, I recommend getitng PSS to assist with this because they can dig deeper into root cause and perhaps identify other issues.
- Marked as answer by Nicholas Li Thursday, May 6, 2010 3:24 AM
Wednesday, May 5, 2010 3:34 AM
All replies
-
Hi,
Just need more precision :
Which version of OpsMgr, version of SQL Server, whera are Database files ?
Regards,
JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/Wednesday, April 21, 2010 10:28 PM -
Hey Jeff, Thanks for the reply,
Here is a basic run down of my config;
· Root Management Server : scom-1a
o ( 2xQuad Core ,14GB RAM ,278 GB Mirrored Local Disk)
o Windows Enterprise 2008 R2 x64
o SCOM 2007 R2 Enterprise x64
· RMS Database Server : scomdb-1a
o OperationsManagerDB
o OperationsManagerDW
o F5 DB
o Windows Enterprise 2003 x64
o SQL 2005 SP3 x64
Thursday, April 22, 2010 3:03 PM -
Hi,
Not sure it will help you, I've got the same problem few monthes ago
with SQL 2005 SP3 on a SQL Server with HP SAN
due to an update with driver for storage system
Regards,
JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/Thursday, April 22, 2010 5:44 PM -
Hey Jeff,
I can have a look at that. The SQL server has attached storage via SAS 5E, so it's possible the drivers were recently updated. Although we use this SQL server for other production data and it runs fine, it's just when SCOM is accessing the SQL Server that it causes the CPU to pin at 100%. The process that is running when the CPU pins is OperationsManagerDW.dbo.StandardDatasetMaintenance;1.
Thx,
Jeff
Thursday, April 22, 2010 6:50 PM -
Quick update, the drivers have not been updated and are the same, so not sure if that'd be the issue. However I'll keep this in mind incase it's a situation where we need to update the drivers.. ThxThursday, April 22, 2010 6:54 PM
-
Hi,
Same problem with OperationsManagerDW.dbo.StandardDatasetMaintenance
Resolution was to reinstall SQL server 2005 with SP2
Regards
JF BERENGUER MVP System Center Operations Manager MCSE, MCT, MCSA messaging, NEXTEC SYSTEMS http://actelia.spaces.live.com/Thursday, April 22, 2010 7:08 PM -
Hmm, That's super interesting. I'll look at trying this. Will take me a bit to do so as we'll need to take the server down during our block time... As well I am away next week, but will be dialing in to check on things, so will let you know!Thursday, April 22, 2010 7:46 PM
-
Hi,
Additionally, please also try the following updates and see how it works:
FIX: CPU usage increases when you run a query that uses a string comparison function on a computer that has many processors after you upgrade to SQL Server 2005 Service Pack 3 or to SQL Server 2008
http://support.microsoft.com/kb/970823
System Center Operations Manager 2007 R2 Cumulative Update 1 Release Notes
http://support.microsoft.com/kb/974144
Hope this helps. Thanks.
Nicholas Li - MSFTFriday, April 23, 2010 6:50 AM -
Hey Nicholas,
Actually I think I'll go this route first as our SQL server has 16 cores on it. And I checked and we haven't applied the 970823 hotfix.
Will try this, this Sunday and let you know,
Thx a bunch
Friday, April 23, 2010 2:48 PM -
Actually, the specified update, 970823, has been applied, there is a newer accumulative update for this which we've applied.
As for the SCOM R2 Cumulative update 1 we have not, so looking in to this one,
Thx
Friday, April 23, 2010 3:26 PM -
Hey Nicholas,
Okay, so I've now applied the SCOM R2 Cumulative update 1. It actually applied without having reboots, however I rebooted my RMS server anyways.
Still no luck, the problem still exists with the CPU pinning at 100%.
I have tried the following too; http://social.technet.microsoft.com/Forums/en/operationsmanagergeneral/thread/729734c3-6780-4af3-989e-319e7d02592b
Again this did not help. I am getting the errors listed in the above link however.... Either way I am stumped, and I have a support call open with Premier Support and they aren't getting anywhere either. Worries me because we've invested a lot of money in to this to help monitor some critical apps within our organization.
Still willing to try anything, let me know.
PS I am away next week, but will dial in periodically to check.
Thx everyone.
Saturday, April 24, 2010 4:04 PM -
Note I have found this error within ops manager event log.
Source of Health Service Module and an Event Number of 10103.
I did some browsing on this error and it typically will occur after you've moved your Data Warehouse DB to another server. We haven't moved the Database, but wondering if anyone has experienced this error and has fixed it?
Thx,
Jeff
Sunday, April 25, 2010 11:01 PM -
Hi,
Thank you for your update. Regarding the Event ID 10103, please try the following:
1. Please check SQL Reporting Services’s configuration; or you can reconfigure SRS with default settings and restart OpsMgr services.
2. Verify the Data Warehouse Run As Profiles settings, they should be default and configured properly.
Event ID 2115 is logged, and a management server generates an "unable to write data to the Data Warehouse" alert in System Center Operations Manager 2007
http://support.microsoft.com/kb/945946
Thanks.
Nicholas Li - MSFTMonday, April 26, 2010 3:52 AM -
Hey Nicholas,
Thanks for the reference. I checked the SRS and its running fine. I also went through point number 2 and it is configured as it should.
Again, a bit of history that may help things...
everything had been running fine since November of 2009 till roughly two weeks ago. About two weeks ago we had a DC go astray with regards to it's time being off by 5 minutes. This affected SCOM as the SCOM server was off by 5 minutes (same as the broken DC), however the SQL server was the appropriate time. With these two servers being off SCOM was throwing some time specific errors, so I sorted the time issues we had, and those errors went away. With the time difference sorted SCOM started reporting new errors, the 31552 events and that is where we are today.
Note, the time difference happened over the weekend so the time difference issue occured over 2 days.
Maybe this will help, but not sure...
Thx
Monday, May 3, 2010 9:40 PM -
31552 is a failure of standarddataset maintenance.
The thing you are seeing - is that htis is failing.... because it is timing out (normal). This job runs every 10 minutes I believe, and has a 5 minute timeout.... the problem is, it is behind and trying to catch up but cant.
This can be caused by poor performance of the SQL server, backlog of data in stagin tables, failure of the RMS to perform maintenance due to a prolonged outage, excessively large tables in the warehouse cause by bugs or grooming failures, excessive overcollection of events, perf, statechanges, etc...
I recommend opening a support case with PSS on this - because it will likely not fix itself.
Generally the fix is to disable the automatic maintenance for the specific dataset that is failing, then running it maunally... which will HAMMER the DB and take several hours.... then enabling the built in maintenance again. However, I recommend getitng PSS to assist with this because they can dig deeper into root cause and perhaps identify other issues.
- Marked as answer by Nicholas Li Thursday, May 6, 2010 3:24 AM
Wednesday, May 5, 2010 3:34 AM -
Hey Kevin,
Thanks for getting back to me on this. You are right in what you say. I've been working with PSS since it broke and they've determined it to possibly be a disk IO issue on the SQL server, and due to the backlog from that weekend of events, the SQL server is having troubles keeping up.
What we are doing through PSS' suggestion, is building a new SQL server specific for SCOM and going to move the databases over.
Through this whole experience I am a bit dissappointed as we had some one from Microsoft come in to assist and approve our design, but it appears our SQL server didn't really match the proper requirement.
Anyways, thanks again for everyones input and help, I really do appreciate it!
Jeff
Tuesday, May 11, 2010 2:51 PM