Setup/Version
We are running multiple PVE 8.1 machines with the most recent updates.
Problem
Sometimes the virtual machines get some kind of deadlock. This occurs mainly during the nightly backup (via "stop"). In the morning we see that the screen of some of the VMs is black. There are numerous messages in the syslog like
However, shutting down the VM helps in a way. Doing this there is the deadlock that can be resumed. But when the VM is completely turned off, the messages to the syslog stop and the VM can be started and rebooted as usual.
Syslog
Here is the syslog of one of the PVEs during backup where the messages first occurred.
Conspicuous is the line
We are running multiple PVE 8.1 machines with the most recent updates.
Problem
Sometimes the virtual machines get some kind of deadlock. This occurs mainly during the nightly backup (via "stop"). In the morning we see that the screen of some of the VMs is black. There are numerous messages in the syslog like
error parsing vmid for <PID>: no matching qemu.slice cgroup entry
Resuming the VM successfully restarts it. Trying to reboot the VM again deadlocks it.However, shutting down the VM helps in a way. Doing this there is the deadlock that can be resumed. But when the VM is completely turned off, the messages to the syslog stop and the VM can be started and rebooted as usual.
Syslog
Here is the syslog of one of the PVEs during backup where the messages first occurred.
Code:
Apr 03 04:05:08 pve vzdump[1771898]: INFO: Starting Backup of VM 102 (qemu)
Apr 03 04:05:09 pve qm[1797037]: <root@pam> starting task UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam:
Apr 03 04:05:09 pve qm[1797038]: shutdown VM 102: UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam:
Apr 03 04:05:11 pve kernel: tap102i0: left allmulticast mode
Apr 03 04:05:11 pve kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Apr 03 04:05:11 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:11 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:11 pve kernel: fwln102i0 (unregistering): left allmulticast mode
Apr 03 04:05:11 pve kernel: fwln102i0 (unregistering): left promiscuous mode
Apr 03 04:05:11 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:11 pve kernel: fwpr102p0 (unregistering): left allmulticast mode
Apr 03 04:05:11 pve kernel: fwpr102p0 (unregistering): left promiscuous mode
Apr 03 04:05:11 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:11 pve qmeventd[542]: read: Connection reset by peer
Apr 03 04:05:11 pve qm[1797037]: <root@pam> end task UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam: OK
Apr 03 04:05:11 pve systemd[1]: 102.scope: Deactivated successfully.
Apr 03 04:05:11 pve systemd[1]: Stopped 102.scope.
Apr 03 04:05:11 pve systemd[1]: 102.scope: Consumed 5h 28min 42.277s CPU time.
Apr 03 04:05:12 pve systemd[1]: Started 102.scope.
Apr 03 04:05:12 pve qmeventd[1797078]: Starting cleanup for 102
Apr 03 04:05:12 pve qmeventd[1797078]: trying to acquire lock...
Apr 03 04:05:12 pve kernel: tap102i0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered blocking state
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:12 pve kernel: fwpr102p0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwpr102p0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered blocking state
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered forwarding state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:12 pve kernel: fwln102i0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwln102i0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered forwarding state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Apr 03 04:05:12 pve kernel: tap102i0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered forwarding state
Apr 03 04:05:12 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:12 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:12 pve systemd[1]: 102.scope: Deactivated successfully.
Apr 03 04:05:12 pve qmeventd[1797078]: OK
Apr 03 04:05:12 pve qmeventd[1797078]: vm still running
Apr 03 04:05:17 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:17 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:22 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:22 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:27 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:27 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:32 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:32 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:37 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Conspicuous is the line
Apr 03 04:05:11 pve qmeventd[542]: read: Connection reset by peer
. I guess that qmeventd crashes for some reason and is then unable to get a valid info regarding pid/vmid.