We currently face a critical issue which occured yesterday exactly when the daily backup (all vms in the cluster) was triggered.
As soon as the backup started, one of our production VMs froze (service was not available anymore).
We found out that all I/O was blocked. As soon as a service requested I/O, it hung idefinitely.
There was no slow I/O, it just hung. Only a hard stop and start of the VM did help to recover it.
This resulted in various kernel messages on the affected VMs console:
We managed to recreate the freeze on vm-1112 by backing up vm-1102 at 20:08 this evening and found out some details:
journal of the PVE node (Svr01) when recreating the issue:
As soon as the backup started, one of our production VMs froze (service was not available anymore).
We found out that all I/O was blocked. As soon as a service requested I/O, it hung idefinitely.
There was no slow I/O, it just hung. Only a hard stop and start of the VM did help to recover it.
This resulted in various kernel messages on the affected VMs console:
Code:
CentOS Linux 7 (Core)
Kernel 3.10.0-1160.71.1.e17.x86_64 on an x86_64
localhost login: [128757.960415] INFO: task mariadb:1870 blocked for more than 120 seconds.
[128757.960643] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128757.961155] INFO: task mariadb:1956 blocked for more than 120 seconds.
[128757.961309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128757.961626] INFO: task mariadb:1981 blocked for more than 120 seconds.
[128757.961805] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128757.962066] INFO: task zabbix_server:2007 blocked for more than 120 seconds.
[128757.962257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128757.962543] INFO: task postgres:2908 blocked for more than 120 seconds.
[128757.962708] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128877.960722] INFO: task mariadb:1870 blocked for more than 120 seconds.
[128877.960893] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128877.961475] INFO: task mariadb:1956 blocked for more than 120 seconds.
[128877.961697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128877.961994] INFO: task mariadb:1981 blocked for more than 120 seconds.
[128877.962173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128877.962422] INFO: task zabbix_server:2007 blocked for more than 120 seconds.
[128877.962638] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[128877.962872] INFO: task postgres:2908 blocked for more than 120 seconds.
[128877.963041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
(There might be some typos as I did type this by hand)
We managed to recreate the freeze on vm-1112 by backing up vm-1102 at 20:08 this evening and found out some details:
- freeze occurs on backup to our pbs host of a specific VM (id 1102, see logs below)
- VM 1102 is in stopped state (before and after backup, its just turned off all the time)
- VM that freezes is on the same node in the cluster (id 1112, see logs below)
- the host node itself seems to freeze for some time, as there are unexpected messages regarding network interfaces in the journal (see below)
- both VMs use our ceph cluster as storage. storage pool (called ssd_pool, see logs below) consists of 6 OSDs of type ssd, distributed over our 3 nodes
- pbs is on a separate bare metal host. backup datastore is hosted on a zfs mirror of 2 8tb-sas-ssds
- ceph+cluster network is separated, there is no VM or backup traffic inside this network
- all servers (pve nodes, pbs host) are HPE G10 servers. Storage controller is configured in IT mode, so the hosts (i.e. zfs / ceph) can directly access the drives
- there is no high IO delay during backups
- qm suspend 1112 && qm resume 1112 does not unfreeze I/O
- there is no entry for scheduled backups in the task log for Svr01. Svr02 and 03 are shown, but the entry for Svr01 is missing despite the backups being created
- during the freeze of the VM (yesterday, not when recreating the problem), it seemed that the host/node (Svr01) did also freeze, as the switch where the nodes network cables terminate reported the ports being down for
- guest agent is enabled and running von the freezing VM. Hower backing up the VM itself in snapshot mode to pbs does not result in a freeze, despite the guest agent issuing fs-freeze and fs-thaw commands
- another freeze occured some weeks ago. As it only affected a non-productive VM we did not do any further research. It was another VM on another node, so the problem seems to affect the whole cluster
journal of the PVE node (Svr01) when recreating the issue:
Code:
Jan 18 20:08:09 Svr01 pvedaemon[2192084]: <mnih@pve> starting task UPID:Svr01:00237358:0138A183:65A97719:vzdump:1102:mnih@pve:
Jan 18 20:08:09 Svr01 pvedaemon[2323288]: INFO: starting new backup job: vzdump 1102 --notes-template '{{guestname}} test' --storage pbs-Svr07 --node Svr01 --notification-mode auto --mode snapshot --remove 0
Jan 18 20:08:09 Svr01 pvedaemon[2323288]: INFO: Starting Backup of VM 1102 (qemu)
Jan 18 20:08:10 Svr01 systemd[1]: Started 1102.scope.
Jan 18 20:08:11 Svr01 kernel: tap1102i0: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered disabled state
Jan 18 20:08:11 Svr01 kernel: fwpr1102p0: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwpr1102p0: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered forwarding state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered disabled state
Jan 18 20:08:11 Svr01 kernel: fwln1102i0: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwln1102i0: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered forwarding state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 2(tap1102i0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 2(tap1102i0) entered disabled state
Jan 18 20:08:11 Svr01 kernel: tap1102i0: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 2(tap1102i0) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i0: port 2(tap1102i0) entered forwarding state
Jan 18 20:08:11 Svr01 kernel: tap1102i1: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered disabled state
Jan 18 20:08:11 Svr01 kernel: fwpr1102p1: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwpr1102p1: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered forwarding state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered disabled state
Jan 18 20:08:11 Svr01 kernel: fwln1102i1: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwln1102i1: entered promiscuous mode
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered forwarding state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 2(tap1102i1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 2(tap1102i1) entered disabled state
Jan 18 20:08:11 Svr01 kernel: tap1102i1: entered allmulticast mode
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 2(tap1102i1) entered blocking state
Jan 18 20:08:11 Svr01 kernel: fwbr1102i1: port 2(tap1102i1) entered forwarding state
Jan 18 20:08:12 Svr01 kernel: tap1102i2: entered promiscuous mode
Jan 18 20:08:12 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered disabled state
Jan 18 20:08:12 Svr01 kernel: fwpr1102p2: entered allmulticast mode
Jan 18 20:08:12 Svr01 kernel: fwpr1102p2: entered promiscuous mode
Jan 18 20:08:12 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered forwarding state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered disabled state
Jan 18 20:08:12 Svr01 kernel: fwln1102i2: entered allmulticast mode
Jan 18 20:08:12 Svr01 kernel: fwln1102i2: entered promiscuous mode
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered forwarding state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 2(tap1102i2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 2(tap1102i2) entered disabled state
Jan 18 20:08:12 Svr01 kernel: tap1102i2: entered allmulticast mode
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 2(tap1102i2) entered blocking state
Jan 18 20:08:12 Svr01 kernel: fwbr1102i2: port 2(tap1102i2) entered forwarding state
Jan 18 20:08:52 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:09:02 Svr01 corosync[1911]: [TOTEM ] Retransmit List: 1445b9
Jan 18 20:09:29 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:09:30 Svr01 sshd[2324422]: Accepted publickey for root from xxx.xxx.xxx.xxx port 44960 ssh2: RSA SHA256:Dm+KcoGNqZYkml8RCuj37dTWw5FrFRXOi/8YnN8BZ+A
Jan 18 20:09:30 Svr01 sshd[2324422]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 18 20:09:30 Svr01 systemd-logind[1501]: New session 110 of user root.
Jan 18 20:09:30 Svr01 systemd[1]: Started session-110.scope - Session 110 of User root.
Jan 18 20:09:30 Svr01 sshd[2324422]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 18 20:10:16 Svr01 sshd[2324422]: Received disconnect from xxx.xxx.xxx.xxx port 44960:11: disconnected by user
Jan 18 20:10:16 Svr01 sshd[2324422]: Disconnected from user root xxx.xxx.xxx.xxx port 44960
Jan 18 20:10:16 Svr01 sshd[2324422]: pam_unix(sshd:session): session closed for user root
Jan 18 20:10:16 Svr01 systemd-logind[1501]: Session 110 logged out. Waiting for processes to exit.
Jan 18 20:10:16 Svr01 systemd[1]: session-110.scope: Deactivated successfully.
Jan 18 20:10:16 Svr01 systemd-logind[1501]: Removed session 110.
Jan 18 20:10:16 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:10:16 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:10:16 Svr01 sshd[2324989]: Accepted publickey for root from xxx.xxx.xxx.xxx port 46386 ssh2: RSA SHA256:Dm+KcoGNqZYkml8RCuj37dTWw5FrFRXOi/8YnN8BZ+A
Jan 18 20:10:16 Svr01 sshd[2324989]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 18 20:10:16 Svr01 systemd-logind[1501]: New session 111 of user root.
Jan 18 20:10:16 Svr01 systemd[1]: Started session-111.scope - Session 111 of User root.
Jan 18 20:10:16 Svr01 sshd[2324989]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 18 20:10:33 Svr01 sshd[2324989]: Received disconnect from xxx.xxx.xxx.xxx port 46386:11: disconnected by user
Jan 18 20:10:33 Svr01 sshd[2324989]: Disconnected from user root xxx.xxx.xxx.xxx port 46386
Jan 18 20:10:33 Svr01 sshd[2324989]: pam_unix(sshd:session): session closed for user root
Jan 18 20:10:33 Svr01 systemd[1]: session-111.scope: Deactivated successfully.
Jan 18 20:10:33 Svr01 systemd-logind[1501]: Session 111 logged out. Waiting for processes to exit.
Jan 18 20:10:33 Svr01 systemd-logind[1501]: Removed session 111.
Jan 18 20:10:33 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:10:59 Svr01 pmxcfs[1879]: [status] notice: received log
Jan 18 20:10:59 Svr01 sshd[2325444]: Accepted publickey for root from xxx.xxx.xxx.xxx port 36074 ssh2: RSA SHA256:Dm+KcoGNqZYkml8RCuj37dTWw5FrFRXOi/8YnN8BZ+A
Jan 18 20:10:59 Svr01 sshd[2325444]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Jan 18 20:10:59 Svr01 systemd-logind[1501]: New session 112 of user root.
Jan 18 20:10:59 Svr01 systemd[1]: Started session-112.scope - Session 112 of User root.
Jan 18 20:10:59 Svr01 sshd[2325444]: pam_env(sshd:session): deprecated reading of user environment enabled
Jan 18 20:11:12 Svr01 kernel: tap1102i0: left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwbr1102i0: port 2(tap1102i0) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered disabled state
Jan 18 20:11:12 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwln1102i0 (unregistering): left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwln1102i0 (unregistering): left promiscuous mode
Jan 18 20:11:12 Svr01 kernel: fwbr1102i0: port 1(fwln1102i0) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwpr1102p0 (unregistering): left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwpr1102p0 (unregistering): left promiscuous mode
Jan 18 20:11:12 Svr01 kernel: vmbr0v680: port 3(fwpr1102p0) entered disabled state
Jan 18 20:11:12 Svr01 kernel: tap1102i1: left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwbr1102i1: port 2(tap1102i1) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered disabled state
Jan 18 20:11:12 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwln1102i1 (unregistering): left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwln1102i1 (unregistering): left promiscuous mode
Jan 18 20:11:12 Svr01 kernel: fwbr1102i1: port 1(fwln1102i1) entered disabled state
Jan 18 20:11:12 Svr01 kernel: fwpr1102p1 (unregistering): left allmulticast mode
Jan 18 20:11:12 Svr01 kernel: fwpr1102p1 (unregistering): left promiscuous mode
Jan 18 20:11:12 Svr01 kernel: vmbr0v580: port 3(fwpr1102p1) entered disabled state
Jan 18 20:11:13 Svr01 kernel: tap1102i2: left allmulticast mode
Jan 18 20:11:13 Svr01 kernel: fwbr1102i2: port 2(tap1102i2) entered disabled state
Jan 18 20:11:13 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered disabled state
Jan 18 20:11:13 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered disabled state
Jan 18 20:11:13 Svr01 kernel: fwln1102i2 (unregistering): left allmulticast mode
Jan 18 20:11:13 Svr01 kernel: fwln1102i2 (unregistering): left promiscuous mode
Jan 18 20:11:13 Svr01 kernel: fwbr1102i2: port 1(fwln1102i2) entered disabled state
Jan 18 20:11:13 Svr01 kernel: fwpr1102p2 (unregistering): left allmulticast mode
Jan 18 20:11:13 Svr01 kernel: fwpr1102p2 (unregistering): left promiscuous mode
Jan 18 20:11:13 Svr01 kernel: vmbr0v590: port 2(fwpr1102p2) entered disabled state
Jan 18 20:11:13 Svr01 qmeventd[1506]: read: Connection reset by peer
Jan 18 20:11:13 Svr01 systemd[1]: 1102.scope: Deactivated successfully.
Jan 18 20:11:13 Svr01 systemd[1]: 1102.scope: Consumed 5min 20.331s CPU time.
Jan 18 20:11:14 Svr01 qmeventd[2325704]: Starting cleanup for 1102
Jan 18 20:11:14 Svr01 qmeventd[2325704]: Finished cleanup for 1102
Jan 18 20:11:14 Svr01 pvedaemon[2323288]: INFO: Finished Backup of VM 1102 (00:03:05)
Jan 18 20:11:14 Svr01 pvedaemon[2323288]: INFO: Backup job finished successfully
Jan 18 20:11:14 Svr01 postfix/pickup[2312866]: 51D234E55E: uid=0 from=<root>
Jan 18 20:11:14 Svr01 pvedaemon[2192084]: <mnih@pve> end task UPID:Svr01:00237358:0138A183:65A97719:vzdump:1102:mnih@pve: OK
Jan 18 20:11:14 Svr01 postfix/cleanup[2325839]: 51D234E55E: message-id=<20240118191114.51D234E55E@Svr01.local>
Jan 18 20:11:14 Svr01 postfix/qmgr[1875]: 51D234E55E: from=<root@Svr01.local>, size=13596, nrcpt=1 (queue active)
Jan 18 20:11:15 Svr01 postfix/smtp[2325841]: 51D234E55E: to=<mnih@somecompany.com>, relay=somecompany-com.mail.protection.outlook.com[aaa.bbb.ccc.ddd]:25, delay=0.74, delays=0.01/0.01/0.14/0.58, dsn=2.6.0, status=sent (250 2.6.0 <20240118191114.51D234E55E@Svr01.local> [InternalId=35377645623599, Hostname=someserver.eurprd07.prod.outlook.com] 21908 bytes in 0.150, 142.055 KB/sec Queued mail for delivery)
Last edited: