locked
Cent OS 6.5 Clock Drift and setting time on restore RRS feed

  • Question

  • I have Cent OS 6.5 on hyperv 2008 r2

    I also verified the problems on Cent OS 7.  I assume RHEL would also be the same.

    Heartbeat is showing OK.

    Its using hyperv clocksource

    #cat /sys/devices/system/clocksource/clocksource0/current_clocksource
    hyperv_clocksource

    All hv drivers are loaded

    #lsmod | grep hv
    hv_netvsc              23702  0
    hv_utils                9149  0
    hv_storvsc             11323  2
    hv_vmbus              144850  5 hv_netvsc,hv_utils,hid_hyperv,hyperv_fb,hv_storvsc
    #

    I have 2 problems.

    1.  The hyperv_clock source is fast I use ntpd to slew it. Drift file is always around the -250 which I can live with but seeing as its always fast I wonder why?  Surely time sync should sort it all out.

    #cat  /var/lib/ntp/drift
    -256.755
    #

    2.  When backing up (save and restore on 2008 R2)  the time is not set in the guest on restore and the machine is 30 seconds or so slow until ntpd steps the clock

    cat /var/log/messages

    Backup just before these lines 

    Sep 10 01:40:41 aeslinux01 kernel: Clocksource tsc unstable (delta = -68719343006 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
    Sep 10 01:41:45 aeslinux01 ntpd[1344]: 0.0.0.0 0628 08 no_sys_peer
    Sep 10 01:49:19 aeslinux01 ntpd[1344]: 0.0.0.0 0613 03 spike_detect +39.569404 s
    Sep 10 01:54:21 aeslinux01 ntpd[1344]: 0.0.0.0 061c 0c clock_step +39.559218 s
    Sep 10 01:54:21 aeslinux01 ntpd[1344]: 0.0.0.0 0615 05 clock_sync
    Sep 10 01:54:22 aeslinux01 ntpd[1344]: 0.0.0.0 c618 08 no_sys_peer

    Now it is my understanding that the clock should be stepped by hyperv on restore and I have seen this work in this senario .  Some times it never works.  Sometimes it may work a few times but once it stops working it never steps again. 

    Fortunately ntpd picks up the pieces after 15 mins or so but of course this is far from ideal.

    My questions are

    1.   How does hyperv_clocksource work.  Is it just a tick provider ie no time of day is ever exchanged.  If so how is the time reset on restore.

    2.  Is there any debugging I can do to workout why time is not being set by hyperv on restore.

    Thanks

    Mike


    • Edited by Mike Surcouf Wednesday, September 10, 2014 10:20 AM
    Wednesday, September 10, 2014 10:06 AM

Answers

  • We've recently become aware of a bug in the Linux Integration Services (LIS) for Hyper-V that causes the time to not get re-synced after a restore.  The Hyper-V host sends clock sync messages to the Linux guest after the restore, but due to the LIS bug, the Linux guest doesn't process the messages correctly.

    We are working on fixing the bug and testing the fix.  I'm not sure yet what distribution mechanisms we will have for the fix other than the next minor updates to the relevant Linux distros.   In the meantime, the only workaround is what you noted -- let NTP get the time back in sync.


    Michael Kelley, Lead Program Manager, Open Source Technology Center

    • Marked as answer by Mike Surcouf Wednesday, September 17, 2014 8:47 AM
    Wednesday, September 10, 2014 3:51 PM
    Moderator

All replies

  • We've recently become aware of a bug in the Linux Integration Services (LIS) for Hyper-V that causes the time to not get re-synced after a restore.  The Hyper-V host sends clock sync messages to the Linux guest after the restore, but due to the LIS bug, the Linux guest doesn't process the messages correctly.

    We are working on fixing the bug and testing the fix.  I'm not sure yet what distribution mechanisms we will have for the fix other than the next minor updates to the relevant Linux distros.   In the meantime, the only workaround is what you noted -- let NTP get the time back in sync.


    Michael Kelley, Lead Program Manager, Open Source Technology Center

    • Marked as answer by Mike Surcouf Wednesday, September 17, 2014 8:47 AM
    Wednesday, September 10, 2014 3:51 PM
    Moderator
  • Hi Michael

    1.  At  least its confirmed.  Do you have any bug database I can track this on or just keep checking lkml.org

    If I can help with testing let me know.

    2.  Would that bug also have any effect on the timekeeping of hyperv_clocksource.

    My other problem (which is widely reported) is that my Linux VM system clock is consistently fast. by the order of 10s of seconds per day.  Not widlly off but still would eventually break Kerberos if not checked with ntp.

    I think hyperv_clocksource is a tick provider.? But it seems it is never kept in check with the host by the timesync component.  So eventually the host and guest clocks diverge which really mean timesync is not doing its job.  I have to use ntp to workaround this issue.

    Does the hyper-v host occasionally send a clock sync from host to VM.  In which case maybe this bug also has some bearing here.

    I have older Linux vms kernels that keep perfect time without ntp so something changed a few years back in the way it was done.

    Thanks

    Mike

    Wednesday, September 10, 2014 4:17 PM
  • Also affects live migration as host is down momentarily.
    Monday, September 15, 2014 12:34 PM
  • Hi Michael

    I have another vm that seems to step back in time on restore.

    This is a more serious issue.  Is this something you have also seen.  Maybe due to a partial corrupt receipt of the time sync message.  I cant see any difference in the 2 vms except this one has more data perhaps causing a difference in timing.

    Anyway I can monitor the progress of this issue?

    Thanks

    Mike

    Save restore here approx. 22 Sep 01:30  note big offset 241464 seconds (nearly 3 days) back in time on restore.

    Logs below

    Sep 19 09:02:35 aesrocc01 kernel: Clocksource tsc unstable (delta = -34360102863 ns).  Enable clocksource failover by adding clocksource_failover kernel parameter.
    Sep 19 09:11:28 aesrocc01 ntpd[1337]: 0.0.0.0 0613 03 spike_detect +241464.723103 s
    Sep 22 04:25:40 aesrocc01 ntpd[1337]: 0.0.0.0 061c 0c clock_step +241464.736998 s
    Sep 22 04:25:40 aesrocc01 ntpd[1337]: 0.0.0.0 0615 05 clock_sync
    Sep 22 04:25:41 aesrocc01 ntpd[1337]: 0.0.0.0 c618 08 no_sys_peer
    ~



    Monday, September 22, 2014 5:09 PM
  • After spending many days on this here is some concrete information for you

    Problem 1 - Time is not set correctly on live migration or save restore

    This is fixed and confirmed by be applying this patch.  Unfortunately the patch never mentions the above problem so it is a silent fix but at least we have it.

    >For pause/resume or save/restore case, the time sync IC will set the guest time using host time sample. (In this case, host will send a ICTIMESYNCFLAG_SYNC message). But there is a bug in VM Bus Channel code, that will cause time sync IC service stop running after a long time (like one day). It is fixed by the following patch:

    https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=affb1aff300ddee54df307812b38f166e8a865ef

    You may also want to reference this RHEL bug report with states this is pulled for 6.6 but not confirmed for 7.0 or 6.5

    https://bugzilla.redhat.com/show_bug.cgi?id=1118123

    Problem 2 -  Clock drift using hyperv_clocksource

    The time sync component of Linux on hyperv does not work as many expect.

    Time is sycronised on boot form a host time sample but from then on time is only referenced to the hyperv_clocksource which is calibrated by Microsoft developers under various cpu loads.  It is calibrated with the assumption that it will be close enough for ntp to keep up and keep the clock accurate.Therefore at this time NTP IS REQUIRED in the guest.

    However as all environments differ some have found that hyperv_clocksource is not accurate enough for ntp and they are left to try and get it closer using tickadj adjtimex.

    In response to this Microsoft are looking to improve the timesync component with this patch.  However they still recommend NTP as the preferred option at this time.  Hopefully in the future this patch will provide an accurate stable time for guests without NTP although I think its going to require recreating elements of NTP to get it really good.

    https://lkml.org/lkml/2014/9/26/270

    PS please do not post to this list unless you really know what you are doing as the guys are busy here.

    I hope this saves someone some time working it all out and that timesync on hyperv may become a no brainer in the future.

    Regards

    Mike

    Monday, September 29, 2014 12:03 PM
  • Spoke too soon problem 1 remains even after patch :-(
    Tuesday, September 30, 2014 11:16 AM
  • Problem 1 - Time is not set correctly on live migration or save restore

    My testing was flawed. 

    Full kernel rebuild with

    Drivers: hv: vmbus: Fix a bug in the channel callback dispatch code

    https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=affb1aff300ddee54df307812b38f166e8a865ef

    Fixes the problem .

    Patch included in RHEL 6.6 and 7.1

    Wednesday, October 15, 2014 11:55 AM