VMs randomly freezing during stop backup

Feb 3, 2020
10
0
21
Setup/Version
We are running multiple PVE 8.1 machines with the most recent updates.

Problem
Sometimes the virtual machines get some kind of deadlock. This occurs mainly during the nightly backup (via "stop"). In the morning we see that the screen of some of the VMs is black. There are numerous messages in the syslog like error parsing vmid for <PID>: no matching qemu.slice cgroup entry Resuming the VM successfully restarts it. Trying to reboot the VM again deadlocks it.
However, shutting down the VM helps in a way. Doing this there is the deadlock that can be resumed. But when the VM is completely turned off, the messages to the syslog stop and the VM can be started and rebooted as usual.

Syslog
Here is the syslog of one of the PVEs during backup where the messages first occurred.

Code:
Apr 03 04:05:08 pve vzdump[1771898]: INFO: Starting Backup of VM 102 (qemu)
Apr 03 04:05:09 pve qm[1797037]: <root@pam> starting task UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam:
Apr 03 04:05:09 pve qm[1797038]: shutdown VM 102: UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam:
Apr 03 04:05:11 pve kernel: tap102i0: left allmulticast mode
Apr 03 04:05:11 pve kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Apr 03 04:05:11 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:11 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:11 pve kernel: fwln102i0 (unregistering): left allmulticast mode
Apr 03 04:05:11 pve kernel: fwln102i0 (unregistering): left promiscuous mode
Apr 03 04:05:11 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:11 pve kernel: fwpr102p0 (unregistering): left allmulticast mode
Apr 03 04:05:11 pve kernel: fwpr102p0 (unregistering): left promiscuous mode
Apr 03 04:05:11 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:11 pve qmeventd[542]: read: Connection reset by peer
Apr 03 04:05:11 pve qm[1797037]: <root@pam> end task UPID:pve:001B6BAE:0373073F:660CB955:qmshutdown:102:root@pam: OK
Apr 03 04:05:11 pve systemd[1]: 102.scope: Deactivated successfully.
Apr 03 04:05:11 pve systemd[1]: Stopped 102.scope.
Apr 03 04:05:11 pve systemd[1]: 102.scope: Consumed 5h 28min 42.277s CPU time.
Apr 03 04:05:12 pve systemd[1]: Started 102.scope.
Apr 03 04:05:12 pve qmeventd[1797078]: Starting cleanup for 102
Apr 03 04:05:12 pve qmeventd[1797078]: trying to acquire lock...
Apr 03 04:05:12 pve kernel: tap102i0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered blocking state
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered disabled state
Apr 03 04:05:12 pve kernel: fwpr102p0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwpr102p0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered blocking state
Apr 03 04:05:12 pve kernel: vmbr0: port 4(fwpr102p0) entered forwarding state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered disabled state
Apr 03 04:05:12 pve kernel: fwln102i0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwln102i0: entered promiscuous mode
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 1(fwln102i0) entered forwarding state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered disabled state
Apr 03 04:05:12 pve kernel: tap102i0: entered allmulticast mode
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered blocking state
Apr 03 04:05:12 pve kernel: fwbr102i0: port 2(tap102i0) entered forwarding state
Apr 03 04:05:12 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:12 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:12 pve systemd[1]: 102.scope: Deactivated successfully.
Apr 03 04:05:12 pve qmeventd[1797078]:  OK
Apr 03 04:05:12 pve qmeventd[1797078]: vm still running
Apr 03 04:05:17 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:17 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:22 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:22 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:27 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:27 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:32 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry
Apr 03 04:05:32 pve qmeventd[542]: could not get vmid from pid 1797089
Apr 03 04:05:37 pve qmeventd[542]: error parsing vmid for 1797089: no matching qemu.slice cgroup entry

Conspicuous is the line Apr 03 04:05:11 pve qmeventd[542]: read: Connection reset by peer. I guess that qmeventd crashes for some reason and is then unable to get a valid info regarding pid/vmid.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!