"Random" reboot due to NTP/Timesyncd ?

NxwJfm

Member
Jul 6, 2016
9
1
23
48
Hello.

I've just had another random reboot of one of my proxmox servers after a few weeks of stability, but this time I had a centralized syslog which allowed to capture the whole thing. Here is what I got:
Code:
Jul 19 15:42:33 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.001s/+13ppm
Jul 19 15:43:04 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.001s/+13ppm
Jul 19 15:43:04 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:43:36 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:43:36 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:44:07 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:44:07 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:44:39 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:44:39 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:45:10 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:45:10 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:45:41 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:45:41 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:45:59 def92-proxmox4 pveproxy[24082]: Clearing outdated entries from certificate cache
Jul 19 15:46:13 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:46:13 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:46:44 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:46:44 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:47:15 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/-0.872s/0.002s/0.000s/+13ppm
Jul 19 15:47:15 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:47:26 def92-proxmox4 pveproxy[24082]: worker exit
Jul 19 15:47:26 def92-proxmox4 pveproxy[18869]: worker 24082 finished
Jul 19 15:47:26 def92-proxmox4 pveproxy[18869]: starting 1 worker(s)
Jul 19 15:47:26 def92-proxmox4 pveproxy[18869]: worker 28777 started
Jul 19 15:48:17 def92-proxmox4 pvedaemon[2237]: <root@pam> successful auth for user 'root@pam'
Jul 19 15:48:17 def92-proxmox4 pveproxy[28777]: Clearing outdated entries from certificate cache
Jul 19 15:48:20 def92-proxmox4 systemd-timesyncd[1366]: Using NTP server 5.196.160.139:123 (2.debian.pool.ntp.org).
Jul 19 15:53:38 def92-proxmox4 systemd[1]: Time has been changed
Jul 19 15:53:38 def92-proxmox4 systemd-timesyncd[1366]: interval/delta/delay/jitter/drift 32s/+317.737s/0.005s/120.423s/+13ppm
Jul 19 15:53:38 def92-proxmox4 pve-ha-lrm[2320]: loop take too long (322 seconds)
Jul 19 15:53:38 def92-proxmox4 pve-ha-lrm[2320]: lost lock 'ha_agent_def92-proxmox4_lock - can't get cfs lock
Jul 19 15:53:38 def92-proxmox4 watchdog-mux[1899]: client watchdog expired - disable watchdog updates
Jul 19 15:53:38 def92-proxmox4 watchdog-mux[1899]: client watchdog expired - disable watchdog updates
Jul 19 15:53:39 def92-proxmox4 pve-ha-crm[2310]: loop take too long (327 seconds)
Jul 19 15:53:39 def92-proxmox4 pve-ha-crm[2310]: lost lock 'ha_manager_lock - can't get cfs lock
Jul 19 15:53:40 def92-proxmox4 pve-ha-lrm[2320]: successfully acquired lock 'ha_agent_def92-proxmox4_lock'
Jul 19 15:53:40 def92-proxmox4 watchdog-mux[1899]: exit watchdog-mux with active connections
Jul 19 15:53:40 def92-proxmox4 kernel: [2358541.231300] IPMI Watchdog: Unexpected close, not stopping watchdog!
Jul 19 15:53:44 def92-proxmox4 pve-ha-crm[2310]: status change master => lost_manager_lock
Jul 19 15:53:44 def92-proxmox4 pve-ha-crm[2310]: watchdog closed (disabled)
Jul 19 15:53:44 def92-proxmox4 pve-ha-crm[2310]: status change lost_manager_lock => wait_for_quorum
Jul 19 15:53:49 def92-proxmox4 pve-ha-crm[2310]: status change wait_for_quorum => slave
Jul 19 15:53:50 def92-proxmox4 pve-ha-lrm[2320]: watchdog update failed - Broken pipe
And then I get the stuff after the reboot.

The whole loop of systemd-timesyncd stuff started at around 12:44 today. Only this server got this, all 4 others were fine.
Am I right in thinking that this time adjustments lead to the pve-ha daemon thinking it could not lock its file for long enough (or whatever it's trying to do) because the time changed when it tried to do his thing?
I actually have no idea what systemd-timesyncd is trying to do as its servers configuration is commented:
#Servers=0.debian.pool.ntp.org 1.debian.pool.ntp.org 2.debian.pool.ntp.org 3.debian.pool.ntp.org
But timedatectl shows it's enabled:
Code:
      Local time: Wed 2017-07-19 17:08:23 CEST
  Universal time: Wed 2017-07-19 15:08:23 UTC
        RTC time: Wed 2017-07-19 15:08:23
       Time zone: Europe/Paris (CEST, +0200)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: yes
 Last DST change: DST began at
                  Sun 2017-03-26 01:59:59 CET
                  Sun 2017-03-26 03:00:00 CEST
 Next DST change: DST ends (the clock jumps one hour backwards) at
                  Sun 2017-10-29 02:59:59 CEST
                  Sun 2017-10-29 02:00:00 CET

I don't have ntpd installed either.

Did anyone ever get hit by something like this before?

I actually would very much like for this to be the issue as random reboots isn't exactly something I want to ever see on my hypervisors :)
 
I spent some more time looking into it. It's clearly the issue here.
systemd-timesyncd uses SNTP, so it only connects to one server at a time. If this one server goes bad, your own server will be synced to that time too. And in my case when (I guess) someone disabled that server, the daemon reconnected itself to another one which had the right time, and immediately changed the time to 320 seconds later. Leading to a nice almost immediate reboot as SNTP instantly changes time and won't do it gracefully like NTP would.
This is extremely dangerous and should be documented. SNTP is ok for a desktop computer, but is clearly not acceptable for a server. If a dumb guy like me does a standard proxmox install with as little changes as possible, there should not be a default setting that could potentially (more than potential here as you can see) to the server being unstable.
More information on this SNTP issue: https://github.com/systemd/systemd/issues/5735
I'm going to disable the NTP part of systemd-timesync and put ntpd in its place. This documentation page https://pve.proxmox.com/wiki/Configuration_of_Time_Syncronization_(NTP) should be updated saying that systemd-timesync can lead to what happened to me, at the very least.
 
  • Like
Reactions: chrone