none
In-place update of web service causes CPU spike RRS feed

  • Question

  • I've done many in-place updates of my production service with no issues. Today when I ran an update the CPU spiked on both instances to around 99.6% for almost an hour. The CPU eventually came down to its normal range while under the expected load from the service. I didn't see any requests fail, but response times were up significantly.

    I ran another in-place update later in the day and the same thing happened again. Currently my production service is pegged at over 99% for over an hour. Thankfully this is a slow time for the service, otherwise this could be a real mess for me.

    Before each in-place update I deploy to the staging environment and everything is working as expected (there's nothing on the staging environment before as I delete it every time the production environment is successfully updated). As a test I did the same in-place update on the staging environment and that environment has been at 50% utilization for an hour now. It starts as soon as the update takes effect for each instance (one instance pegs a couple minutes before the other).

    I have profiling turned on for this service, and when I go to look at the profiling report I generally get these messages:

    5:51:24 PM - Connecting to the Profiling agent on the VM instance
    5:51:27 PM - The Profiling agent is preparing to process the request
    5:51:28 PM - Preparing the Profiling logs on the VM instance
    5:51:42 PM - Uploading the logs from the VM instance to the Windows Azure storage account.
    5:51:43 PM - Profiling log request failed. Details: Profiling agent failed to upload log snapshot to Windows Azure storage account.

    I'm not sure if it's related to profiling or not, but I'm wondering if there is something broken with the in-place update process in Azure?

    Any help is appreciated.

    Thursday, June 19, 2014 11:56 PM

Answers

  • As further investigation I deleted the staging environment, re-deployed to it, and that worked just fine, CPU levels were normal. I did the VIP swap, and my production environment is back to normal.

    The staging environment (old production env) continued on it's spike for one of the instances. I checked and the instance that isn't spiked is the one I successfully got a profiling report on. I'm going to wait an hour or so and run a profiling report on the other instance to see if that also gets the instance back to normal CPU times.

    Friday, June 20, 2014 12:45 AM

All replies

  • I got a profiling report to work on one of the instances. Here's a quick screenshot.

    I guess I can't post images yet. In summary the hot path shows [clr.dll] taking up most of the time. Following the path down it gets to [mscorlib.ni.dll]

    The Functions doing the most individual work are from:

    [clr.dll]

    [msvcr110_clr0400.dll]

    [WindowsCodecs.dll]

    I'm not sure if this helps, but I figure the more info the better.

    Friday, June 20, 2014 12:34 AM
  • As further investigation I deleted the staging environment, re-deployed to it, and that worked just fine, CPU levels were normal. I did the VIP swap, and my production environment is back to normal.

    The staging environment (old production env) continued on it's spike for one of the instances. I checked and the instance that isn't spiked is the one I successfully got a profiling report on. I'm going to wait an hour or so and run a profiling report on the other instance to see if that also gets the instance back to normal CPU times.

    Friday, June 20, 2014 12:45 AM
  • Sometimes, we only could resolve issue by deleting and re-deploying. You are great!~

    -Billgiee

    Friday, June 20, 2014 12:40 PM
  • I let my staging environment run over night, and the Azure dashboard showed 99% on the second instance the whole time. I ran the profiling report on that instance and the CPU time dropped to less than 1%.

    Probably something for the Azure team to look in to.

    Friday, June 20, 2014 3:14 PM