An experience with unexplained high CPU utilization on a server with an Intel E5 cpu.

A client with 2 identical Hyper-v servers running almost identical VMs. One of the servers out of the blue started having high CPU utilization. The host was bouncing from 35-50% and the guests were at 99% CPU utilization. Turned off the guests and reboot server, no change. Still 35-50% utilization. Made sure any unnecessary hardware was disabled or disconnected, again no change. Experimenting with one of the guest machines it was noticed that the CPU utilization would sometimes show system interrupts at 99% then go away for a bit and then come back with any process that was active taking over the 99% utilization. That led to checking into system interrupts on each host machine and comparing them.

The previous tool of choice was KernView for 32bit machines, however this does not work for modern 64bit machines.  After some digging around on the internet it turns out KernRate works on 64bit machines and can be found in the Windows Driver Development Kit 7 found here If you choose the default install path the files can be found here C:\WinDDK\7600.16385.1\Tools\Other\amd64.

The goal is the log the output for a fixed time in or to allow for comparison.  The proper command was 'kernrate -s 30 -yo filename.txt' which would give  a 30 second sample and write it to a file in the same path with the chosen file name. Running the command on both the host that was not having issues and the one that was having issue.  For brevity below is the output and results.

Server specs (both servers are the same):
Dell 320
32GB ram
Intel E5-2420 CPU (6 hyper-threaded cores)
Server 2012 with Hyper-V role installed

Server with issues:

Results for Kernel Mode:

OutputResults: KernelModuleCount = 147

Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime              276703 hits,           10002 events per hit --------

Module                    Hits                  msec             %Total              Events/Sec
NTOSKRNL                138197            30074              49 %             45961508
HAL                          126880             30074              45 %             42197704
WIN32K                     7230             30026                  2 %                2408394
NTFS                           1030              30055                 0 %                  342773

Server without issues:

Results for Kernel Mode:

OutputResults: KernelModuleCount = 145
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime            289145 hits,          10009 events per hit --------

Module                    Hits                 msec              %Total             Events/Sec
NTOSKRNL            244130           29999                 84 %              81452620
HAL                          41760            29999                 14 %              13932992
WIN32K                   1650             29999                   0 %                   550513
IPMIDRV                   625            30000                   0 %                  208520

The server with issues has 45% of interrupts going to the HAL. The HAL is short for Hardware Abstraction Layer which  is a piece of the operating system that allow other parts of the operating system interact with the physical hardware of the computer. Modern versions of Windows automatically select the HAL used based on the processor type, but I still verified both servers were using the same one. Disabling any unnecessary hardware, turned the guest machines off, updated drivers and ran KernRate between each step, all with very similar results.

After many hours of testing, one last resort before declaring a bad CPU or motherboard and calling Dell for warranty.  The bios and the server rebooted. All disabled devices disabled and the guest machines off in order to limit the changes and allow a baseline test. A few minutes after rebooting, task manager showed a pleasant 10% CPU utilization. Re-enabling all devices and turning the guests back on everything seemed nice and fast, including the guest performance. On file run with KernRate to see if there was any difference in the results.

After BIOS update on bad machine:

OutputResults: KernelModuleCount = 144
Percentage in the following table is based on the Total Hits for the Kernel

ProfileTime                 341514 hits,            10009 events per hit --------

Module                           Hits                msec          %Total             Events/Sec
NTOSKRNL                   332831           29999             97%               111047217
HAL                                   6673           29999               1 %                   2226409
IPMIDRV                          835           29999               0 %                      278593
NTFS                                   395           29999               0 %                       131789

The links below have more information including a document that contains errata for this issue.  If you have a Dell, HP or other server with E5 processor(s) please be sure to update your bios for the best experience. Please be sure to take backups in the case anything bad happens.

Information from Intel: Intel E5 CPU Errata
Related article: windows bugchecks on vmware esxi with xeon e5-2670 cpus