Hey there,
I'd like to collect some feedback on a recent outage we had, which was seemingly triggered by the update of pve-manager (released via the enterprise repo) on August 12th.
One of our PVE clusters, which consists of 9 nodes had 5 of them fail over the course of 15 minutes. The cluster runs about 500 VMs, all of which have a HA State configured and uses Ceph with 108 OSDs.
A rough overview of the outage can be based on the KNET logs of the first node:
pve-02 failed (and briefly reappeared only to fail again), followed by pve-04, pve-05, pve-07 and pve-03. At this point only 4 of 9 nodes were left and the entire cluster failed/fenced, resulting in production outage.
All seems to begin with an automated update of packages on pve-02:
Various PVE related services are restarted as part of the pve-manager update, everything seems fine:
```
pve-02 resumes managing its VMs, but gets stuck in a bad loop when attempting to manage the HA state. There are thousands of logs entries of it trying to change the state of VMs from freeze to started or back. During this period there is no outage, all VMs continue to work fine.
After a few minutes of doing this, pve-02 fences, as pve-ha-crm being stuck seems to result in the watchdog no longer being updated.
pve-04 takes over as manager of pve-ha-crm. It begins to spam about the same problem and one minute after taking over, its watchdog also kills it:
```
This log was not cutoff by me, above message is the last one before the node fenced.
This exact pattern repeated on all other hosts which went down. One by one they took over as HA master with "pve-ha-crm[155556]: status change slave => master", attempted to manage VM HA State and got fenced by their own watchdog quickly, despite networking, all VMs and all services aside from pve-ha-crm working perfectly fine until the last second.
After cold starting the entire cluster, the issue was gone and everything worked as usual. Before this incident, the cluster ran fine across restarts, updates etc. for about a year.
I'd appreciate thoughts on:
Thank you
I'd like to collect some feedback on a recent outage we had, which was seemingly triggered by the update of pve-manager (released via the enterprise repo) on August 12th.
One of our PVE clusters, which consists of 9 nodes had 5 of them fail over the course of 15 minutes. The cluster runs about 500 VMs, all of which have a HA State configured and uses Ceph with 108 OSDs.
A rough overview of the outage can be based on the KNET logs of the first node:
Code:
Aug 12 06:42:48 pve-01 corosync[2916]: [KNET ] link: host: 2 link: 0 is down
Aug 12 06:42:48 pve-01 corosync[2916]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:42:48 pve-01 corosync[2916]: [KNET ] host: host: 2 has no active links
Aug 12 06:45:55 pve-01 corosync[2916]: [KNET ] link: host: 4 link: 0 is down
Aug 12 06:45:55 pve-01 corosync[2916]: [KNET ] host: host: 4 (passive) best link: 0 (pri: 1)
Aug 12 06:45:55 pve-01 corosync[2916]: [KNET ] host: host: 4 has no active links
Aug 12 06:45:57 pve-01 corosync[2916]: [KNET ] link: Resetting MTU for link 0 because host 2 joined
Aug 12 06:45:57 pve-01 corosync[2916]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:45:57 pve-01 corosync[2916]: [KNET ] pmtud: Global data MTU changed to: 8885
Aug 12 06:47:59 pve-01 corosync[2916]: [KNET ] link: host: 5 link: 0 is down
Aug 12 06:47:59 pve-01 corosync[2916]: [KNET ] host: host: 5 (passive) best link: 0 (pri: 1)
Aug 12 06:47:59 pve-01 corosync[2916]: [KNET ] host: host: 5 has no active links
Aug 12 06:50:06 pve-01 corosync[2916]: [KNET ] link: host: 7 link: 0 is down
Aug 12 06:50:06 pve-01 corosync[2916]: [KNET ] host: host: 7 (passive) best link: 0 (pri: 1)
Aug 12 06:50:06 pve-01 corosync[2916]: [KNET ] host: host: 7 has no active links
Aug 12 06:54:42 pve-01 corosync[2916]: [KNET ] link: host: 2 link: 0 is down
Aug 12 06:54:42 pve-01 corosync[2916]: [KNET ] host: host: 2 (passive) best link: 0 (pri: 1)
Aug 12 06:54:42 pve-01 corosync[2916]: [KNET ] host: host: 2 has no active links
Aug 12 06:57:59 pve-01 corosync[2916]: [KNET ] link: host: 3 link: 0 is down
Aug 12 06:57:59 pve-01 corosync[2916]: [KNET ] host: host: 3 (passive) best link: 0 (pri: 1)
Aug 12 06:57:59 pve-01 corosync[2916]: [KNET ] host: host: 3 has no active links
pve-02 failed (and briefly reappeared only to fail again), followed by pve-04, pve-05, pve-07 and pve-03. At this point only 4 of 9 nodes were left and the entire cluster failed/fenced, resulting in production outage.
All seems to begin with an automated update of packages on pve-02:
Code:
Aug 12 06:28:05 pve-02 systemd[1]: Starting apt-daily-upgrade.service - Daily apt upgrade and clean activities...
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Enabled logging to syslog via daemon facility
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Starting unattended upgrades script
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Allowed origins are: origin=Proxmox, o=Proxmox, origin=Debian,codename=bookworm,label=Debian, origin=Debian,codename=bookworm,label=Debian-Security, origin=Debian,codename=bookworm-security,label=Debian-Security, o=Debian,a=stable-updates, o=Zabbix,c=main,n=bookworm
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Initial blacklist:
Aug 12 06:28:05 pve-02 unattended-upgrade[485363]: Initial whitelist (not strict):
Aug 12 06:28:09 pve-02 unattended-upgrade[485363]: Packages that will be upgraded: libpve-common-perl proxmox-backup-client proxmox-backup-file-restore pve-docs pve-manager
Various PVE related services are restarted as part of the pve-manager update, everything seems fine:
Code:
Aug 12 06:29:10 pve-02 pve-ha-lrm[3041484]: watchdog closed (disabled)
Aug 12 06:29:10 pve-02 pve-ha-lrm[3041484]: server stopped
Aug 12 06:29:11 pve-02 systemd[1]: pve-ha-lrm.service: Deactivated successfully.
Aug 12 06:29:11 pve-02 systemd[1]: Stopped pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Aug 12 06:29:11 pve-02 systemd[1]: pve-ha-lrm.service: Consumed 1d 5h 21min 30.790s CPU time.
Aug 12 06:29:11 pve-02 systemd[1]: Starting pve-ha-lrm.service - PVE Local HA Resource Manager Daemon...
Aug 12 06:29:11 pve-02 pve-ha-lrm[488117]: starting server
Aug 12 06:29:11 pve-02 pve-ha-lrm[488117]: status change startup => wait_for_agent_lock
Aug 12 06:29:11 pve-02 systemd[1]: Started pve-ha-lrm.service - PVE Local HA Resource Manager Daemon.
Aug 12 06:29:11 pve-02 systemd[1]: Stopping pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: received signal TERM
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: server received shutdown request
Aug 12 06:29:12 pve-02 pve-ha-crm[3041633]: server stopped
Aug 12 06:29:13 pve-02 systemd[1]: pve-ha-crm.service: Deactivated successfully.
Aug 12 06:29:13 pve-02 systemd[1]: Stopped pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Aug 12 06:29:13 pve-02 systemd[1]: pve-ha-crm.service: Consumed 10min 59.217s CPU time.
Aug 12 06:29:13 pve-02 systemd[1]: Starting pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon...
Aug 12 06:29:14 pve-02 pve-ha-crm[488126]: starting server
Aug 12 06:29:14 pve-02 pve-ha-crm[488126]: status change startup => wait_for_quorum
Aug 12 06:29:14 pve-02 systemd[1]: Started pve-ha-crm.service - PVE Cluster HA Resource Manager Daemon.
Aug 12 06:29:18 pve-02 unattended-upgrade[485363]: All upgrades installed
Aug 12 06:29:19 pve-02 systemd[1]: apt-daily-upgrade.service: Deactivated successfully.
Aug 12 06:29:19 pve-02 systemd[1]: Finished apt-daily-upgrade.service - Daily apt upgrade and clean activities.
Aug 12 06:29:19 pve-02 systemd[1]: apt-daily-upgrade.service: Consumed 31.786s CPU time.
pve-02 resumes managing its VMs, but gets stuck in a bad loop when attempting to manage the HA state. There are thousands of logs entries of it trying to change the state of VMs from freeze to started or back. During this period there is no outage, all VMs continue to work fine.
Code:
Aug 12 06:29:36 pve-02 pve-ha-lrm[488117]: status change wait_for_agent_lock => active
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: successfully acquired lock 'ha_manager_lock'
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: watchdog active
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: status change slave => master
Aug 12 06:31:31 pve-02 pve-ha-crm[488126]: using scheduler mode 'static'
Aug 12 06:31:46 pve-02 pve-ha-crm[488126]: service 'vm:1005012': state changed from 'freeze' to 'started'
(many removed repeats)
Aug 12 06:31:58 pve-02 pve-ha-crm[488126]: service 'vm:2799002': state changed from 'freeze' to 'started'
Aug 12 06:31:58 pve-02 pve-ha-crm[488126]: service 'vm:2834011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:2836011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3436013': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3438011': state changed from 'freeze' to 'started'
Aug 12 06:31:59 pve-02 pve-ha-crm[488126]: service 'vm:3443011': state changed from 'freeze' to 'started'
Aug 12 06:32:00 pve-02 pve-ha-crm[488126]: service 'vm:3448011': state changed from 'freeze' to 'started'
Aug 12 06:32:00 pve-02 pve-ha-crm[488126]: loop take too long (31 seconds)
Aug 12 06:32:11 pve-02 pve-ha-crm[488126]: service 'vm:2233011': state changed from 'request_stop' to 'stopped'
Aug 12 06:32:11 pve-02 pve-ha-crm[488126]: service 'vm:2238011': state changed from 'request_stop' to 'stopped'
Aug 12 06:34:21 pve-02 pve-ha-crm[488126]: service 'vm:1008010': state changed from 'started' to 'freeze'
Aug 12 06:34:21 pve-02 pve-ha-crm[488126]: service 'vm:1098011': state changed from 'started' to 'freeze'
Aug 12 06:34:22 pve-02 pve-ha-crm[488126]: service 'vm:1135011': state changed from 'started' to 'freeze'
(many repeats)
Aug 12 06:42:34 pve-02 pve-ha-crm[488126]: service 'vm:2536011': state changed from 'started' to 'freeze'
Aug 12 06:42:34 pve-02 pve-ha-crm[488126]: service 'vm:2543011': state changed from 'started' to 'freeze'
Aug 12 06:42:35 pve-02 pve-ha-crm[488126]: service 'vm:2598011': state changed from 'started' to 'freeze'
Aug 12 06:42:35 pve-02 watchdog-mux[1971]: client watchdog expired - disable watchdog updates
Aug 12 06:42:35 pve-02 pve-ha-crm[488126]: service 'vm:2640011': state changed from 'started' to 'freeze'
Aug 12 06:42:36 pve-02 pve-ha-crm[488126]: service 'vm:2642011': state changed from 'started' to 'freeze'
Aug 12 06:42:36 pve-02 watchdog-mux[1971]: exit watchdog-mux with active connections
Aug 12 06:42:36 pve-02 systemd-journald[2334617]: Received client request to sync journal.
Aug 12 06:42:36 pve-02 kernel: watchdog: watchdog0: watchdog did not stop!
Aug 12 06:42:36 pve-02 systemd[1]: watchdog-mux.service: Deactivated successfully.
Aug 12 06:42:36 pve-02 systemd[1]: watchdog-mux.service: Consumed 3min 20.188s CPU time.
Aug 12 06:42:36 pve-02 pve-ha-crm[488126]: service 'vm:2732011': state changed from 'started' to 'freeze'
Aug 12 06:42:37 pve-02 pve-ha-crm[488126]: service 'vm:2748011': state changed from 'started' to 'freeze'
Aug 12 06:42:37 pve-02 pve-ha-crm[488126]: service 'vm:2798012': state changed from 'started' to 'freeze'
Aug 12 06:42:38 pve-02 pve-ha-crm[488126]: service 'vm:2898012': state changed from 'started' to 'freeze'
Aug 12 06:42:38 pve-02 pve-ha-crm[488126]: service 'vm:2933011': state changed from 'started' to 'freeze'
Aug 12 06:42:39 pve-02 pve-ha-crm[488126]: service 'vm:2936011': state changed from 'started' to 'freeze'
Aug 12 06:42:39 pve-02 pve-ha-crm[488126]: service 'vm:2936013': state changed from 'started' to 'freeze'
Aug 12 06:42:40 pve-02 pve-ha-crm[488126]: service 'vm:2936015': state changed from 'started' to 'freeze'
Aug 12 06:42:40 pve-02 pve-ha-crm[488126]: service 'vm:2939011': state changed from 'started' to 'freeze'
Aug 12 06:42:41 pve-02 pve-ha-crm[488126]: loop take too long (67 seconds)
Aug 12 06:42:41 pve-02 pve-ha-crm[488126]: watchdog update failed - Broken pipe
-- Boot ab6d795743444738879327aaebb0567a --
After a few minutes of doing this, pve-02 fences, as pve-ha-crm being stuck seems to result in the watchdog no longer being updated.
pve-04 takes over as manager of pve-ha-crm. It begins to spam about the same problem and one minute after taking over, its watchdog also kills it:
Code:
Aug 12 06:43:50 pve-04 pmxcfs[1836186]: [status] notice: RRDC update error /var/lib/rrdcached/db/pve2-storage/pve-06/cephfs: -1
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: successfully acquired lock 'ha_manager_lock'
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: watchdog active
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: status change slave => master
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: using scheduler mode 'static'
Aug 12 06:44:42 pve-04 pve-ha-crm[155556]: node 'pve-02': state changed from 'online' => 'unknown'
Aug 12 06:45:05 pve-04 pve-ha-crm[155556]: service 'vm:1136011': state changed from 'freeze' to 'started'
Aug 12 06:45:06 pve-04 pve-ha-crm[155556]: service 'vm:1136039': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:06 pve-04 pve-ha-crm[155556]: service 'vm:1136040': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1142011': state changed from 'freeze' to 'started'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1142013': state changed from 'freeze' to 'started'
Aug 12 06:45:07 pve-04 pve-ha-crm[155556]: service 'vm:1233012': state changed from 'freeze' to 'started'
Aug 12 06:45:08 pve-04 pve-ha-crm[155556]: service 'vm:1236039': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:08 pve-04 pve-ha-crm[155556]: service 'vm:1242011': state changed from 'freeze' to 'started'
(many removed repeats)
Aug 12 06:45:40 pve-04 pve-ha-crm[155556]: service 'vm:2099002': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:40 pve-04 pve-ha-crm[155556]: service 'vm:2099003': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:41 pve-04 pve-ha-crm[155556]: service 'vm:2105011': state changed from 'freeze' to 'started'
Aug 12 06:45:41 pve-04 pve-ha-crm[155556]: service 'vm:2105012': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2132011': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2133011': state changed from 'freeze' to 'started'
Aug 12 06:45:42 pve-04 pve-ha-crm[155556]: service 'vm:2134011': state changed from 'freeze' to 'started'
Aug 12 06:45:43 pve-04 pve-ha-crm[155556]: service 'vm:2135011': state changed from 'freeze' to 'started'
Aug 12 06:45:43 pve-04 watchdog-mux[2001]: client watchdog expired - disable watchdog updates
Aug 12 06:45:43 pve-04 watchdog-mux[2001]: exit watchdog-mux with active connections
Aug 12 06:45:43 pve-04 systemd-journald[3257916]: Received client request to sync journal.
Aug 12 06:45:43 pve-04 kernel: watchdog: watchdog0: watchdog did not stop!
Aug 12 06:45:43 pve-04 systemd[1]: watchdog-mux.service: Deactivated successfully.
Aug 12 06:45:43 pve-04 systemd[1]: watchdog-mux.service: Consumed 2min 53.237s CPU time.
Aug 12 06:45:43 pve-04 pve-ha-crm[155556]: service 'vm:2136011': state changed from 'freeze' to 'started'
Aug 12 06:45:44 pve-04 pve-ha-crm[155556]: service 'vm:2136013': state changed from 'freeze' to 'started'
Aug 12 06:45:44 pve-04 pve-ha-crm[155556]: service 'vm:2136015': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2137011': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2138011': state changed from 'freeze' to 'started'
Aug 12 06:45:45 pve-04 pve-ha-crm[155556]: service 'vm:2139011': state changed from 'freeze' to 'started'
Aug 12 06:45:46 pve-04 pve-ha-crm[155556]: service 'vm:2140011': state changed from 'freeze' to 'started'
Aug 12 06:45:46 pve-04 pve-ha-crm[155556]: service 'vm:2141011': state changed from 'freeze' to 'started'
Aug 12 06:45:47 pve-04 pve-ha-crm[155556]: service 'vm:2142011': state changed from 'freeze' to 'started'
Aug 12 06:45:47 pve-04 pve-ha-crm[155556]: service 'vm:2143011': state changed from 'freeze' to 'started'
Aug 12 06:45:48 pve-04 pve-ha-crm[155556]: service 'vm:2148011': state changed from 'freeze' to 'started'
Aug 12 06:45:48 pve-04 pve-ha-crm[155556]: service 'vm:2198011': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2198012': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2199002': state changed from 'freeze' to 'started'
Aug 12 06:45:49 pve-04 pve-ha-crm[155556]: service 'vm:2199003': state changed from 'freeze' to 'started'
Aug 12 06:45:50 pve-04 pve-ha-crm[155556]: service 'vm:2205011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:50 pve-04 pve-ha-crm[155556]: service 'vm:2205012': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:51 pve-04 pve-ha-crm[155556]: service 'vm:2237011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:51 pve-04 pve-ha-crm[155556]: service 'vm:2240011': state changed from 'freeze' to 'request_stop'
Aug 12 06:45:52 pve-04 pve-ha-crm[155556]: service 'vm:2242011': state changed from 'freeze' to 'request_stop'
This log was not cutoff by me, above message is the last one before the node fenced.
This exact pattern repeated on all other hosts which went down. One by one they took over as HA master with "pve-ha-crm[155556]: status change slave => master", attempted to manage VM HA State and got fenced by their own watchdog quickly, despite networking, all VMs and all services aside from pve-ha-crm working perfectly fine until the last second.
After cold starting the entire cluster, the issue was gone and everything worked as usual. Before this incident, the cluster ran fine across restarts, updates etc. for about a year.
I'd appreciate thoughts on:
- What could be done to avoid such outages?
- This happened entirely unattended, but even if someone had been present during this failure cascade, we have no idea what could have been done to stop it
- What could be possible causes of this outage?
Thank you