[SOLVED] proxmox crash with ceph clock_skew when time synchronsation is started (chrony)

abma

Active Member
Feb 20, 2021
88
10
28
i got some strange reproducable crash in a ceph cluster with 3 nodes:

ceph had the state HEALTH_WARN with clock_skew detected. To fix this, i've installed chrony and manually started it. A few seconds after starting chrony, the node instantly reseted. This was reproduceable on an other running node: after the time was synced it seems to not crash any more.

thats with pve-kernel-5.4.78-2-pve / proxmox 6.3-3

kern.log doesn't something useful, just some 0 bytes.

did anyone have this crash, too?
 
Guess you have enabled HA for some guests. In which case, it's normal. When chrony starts, it'll make the clock jump (it can be a few seconds, or several minutes, depending on how unsynchronized your clock was). And from corosync POV, this means the node had no contact with the other nodes for this long. So it fences itself.
 
yes, HA is enabled for some guests!

just to be sure it was HA and not a crash caused by a bug:

what to search for in the logs that HA made the node reset?
 
thats the snipped from /var/log/syslog when it rebooted:
Code:
Feb 20 15:34:18 node1 systemd-timesyncd[102319]: Synchronized to time server for the first time 192.168.101.249:123 (ntp).
Feb 20 15:34:18 node1 systemd[1]: Starting Proxmox VE replication runner...
Feb 20 15:34:18 node1 pve-ha-crm[3075]: loop take too long (119 seconds)
Feb 20 15:34:18 node1 systemd[1]: pvesr.service: Succeeded.
Feb 20 15:34:18 node1 systemd[1]: Started Proxmox VE replication runner.
Feb 20 15:34:18 node1 watchdog-mux[2254]: client watchdog expired - disable watchdog updates
Feb 20 15:34:18 node1 pve-ha-lrm[3230]: loop take too long (123 seconds)
Feb 20 15:34:18 node1 pve-ha-lrm[3230]: lost lock 'ha_agent_node1_lock - can't get cfs lock
Feb 20 15:34:23 node1 pve-ha-lrm[3230]: status change active => lost_agent_lock
Feb 20 15:34:18 node1 systemd-modules-load[1888]: Inserted module 'iscsi_tcp'
Feb 20 15:34:18 node1 dmeventd[1901]: dmeventd ready for processing.
Feb 20 15:34:18 node1 systemd-modules-load[1888]: Inserted module 'ib_iser'
Feb 20 15:34:18 node1 lvm[1901]: Monitoring thin pool pve-data.
Feb 20 15:34:18 node1 systemd-modules-load[1888]: Inserted module 'vhost_net'
Feb 20 15:34:18 node1 lvm[1886]:   5 logical volume(s) in volume group "pve" monitored
Feb 20 15:34:18 node1 kernel: [    0.000000] Linux version 5.4.78-2-pve (build@pve) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100) ()
 
i couldn't find a clear info that HA reseted the server, just several entries about vm's beeing fenced: but i assume you're right.

having a stale clock is very bad in a lot of other places: as long as this doesn't happen again with a correct clock on all nodes i see this as solved.

thanks a lot for your help!