VMs hang on shutdown during backup job

NullBy7e · May 20, 2024

I run Ubuntu 22.04 on all my VMs and some of them just seem to hang after they have finished their shutdown tasks, with backup option STOP.
Is there anything I can do to fix this problem or work around it?

Log error I frequently see:

Code:

qmeventd[839]: error parsing vmid for 358021: no matching qemu.slice cgroup entry
qmeventd[839]: could not get vmid from pid 358021

VMs hang at:

Code:

systemd[1]: Finished System Power Off.
systemd[1]: Reached target System Power Off.
systemd[1]: Shutting down.
systemd-shutdown[1]: Syncing filesystems and block devices.
systemd-journald[371]: Journal stopped

fiona · May 21, 2024

Hi,
please share the output of pveversion -v, qm config <ID> for an affected VM and cat /proc/cmdline.

NullBy7e · May 22, 2024

fiona said:
Hi,
please share the output of pveversion -v, qm config <ID> for an affected VM and cat /proc/cmdline.

pveversion:

Code:

proxmox-ve: 8.2.0 (running kernel: 6.8.4-2-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
proxmox-kernel-6.5.13-5-pve-signed: 6.5.13-5
proxmox-kernel-6.5: 6.5.13-5
amd64-microcode: 3.20230808.1.1~deb12u1
ceph-fuse: 16.2.11+ds-2
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown: residual config
ifupdown2: 3.2.0-1+pmx8
intel-microcode: 3.20231114.1~deb12u1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.1
libpve-http-server-perl: 5.1.0
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.2-1
proxmox-backup-file-restore: 3.2.2-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: not correctly installed
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0

qm config:

Code:

agent: 1
balloon: 0
boot: order=scsi0
cores: 1
cpu: host
hotplug: memory,cpu
memory: 2048
meta: creation-qemu=8.1.5,ctime=1710971638
name: storage
net0: virtio=BC:24:11:7F:4F:37,bridge=vmbr4,firewall=1,queues=1
numa: 1
onboot: 1
ostype: l26
parent: v2
scsi0: local:100402/vm-100402-disk-0.qcow2,discard=on,iothread=1,size=100G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f6dccc4f-3870-486c-85a2-6e7e9260c4c4
sockets: 1
startup: order=1
tags: internal
vmgenid: 9be97f43-0859-4f3d-adb7-4491dbd5b9bc

cat /proc/cmdline:

Code:

BOOT_IMAGE=/vmlinuz-6.8.4-2-pve root=/dev/mapper/vg0-root ro consoleblank=0 systemd.show_status=true consoleblank=0

Also one thing to note is that this does not happen everytime, so I can't reliably reproduce it.
The VM suffers from this issue sporadically but others do too, they rely on this VM for the fuse automount.
Perhaps this is related? I could not find anything in any of the VM's logs though.

_gabriel · May 22, 2024

NullBy7e said:
backup option STOP

VM restart directly after backup start, so VM start during backup job, if destination is slow, VM can slowdown to hang or crash.
Try "suspend" mode.

NullBy7e · May 22, 2024

_gabriel said:
VM restart directly after backup start, so VM start during backup job, if destination is slow, VM can slowdown to hang or crash.
Try "suspend" mode.

I thought suspend mode was more likely to corrupt your VM because it isn't stopped completely prior to the backup task?

_gabriel · May 22, 2024

better is backup after you have manually shutdown guest, trying with suspend (afaik, it's a "hibernated" state) is only to check if hang persist.

fiona · May 27, 2024

_gabriel said:
VM restart directly after backup start, so VM start during backup job, if destination is slow, VM can slowdown to hang or crash.
Try "suspend" mode.

(EDIT: incomplete+irrelevant: ~~The VM is only started in prelaunch mode and doesn't do any execution or IO during stop mode backup. Except if you manually resume it.~~) For VMs, suspend mode is deprecated: https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_backup_modes

@NullBy7e please share the output of cat /proc/$(cat /var/run/qemu-server/XYZ.pid)/cgroup replacing XYZ with the ID of a VM that was/is affected by the issue. You need to run it before the backup while the VM is still running. Please also share a larger excerpt of the journal from around the time the issue is happening.

_gabriel · May 27, 2024

fiona said:
The VM is only started in prelaunch mode and doesn't do any execution or IO during stop mode backup.

https://pve.proxmox.com/pve-docs/chapter-vzdump.html#_backup_modes said:
Backup modes for VMs:
stop mode
This mode provides the highest consistency of the backup, at the cost of a short downtime in the VM operation.
It works by executing an orderly shutdown of the VM, and then runs a background QEMU process to backup the VM data.
After the backup is started, the VM goes to full operation mode if it was previously running.
Consistency is guaranteed by using the live backup feature.

Seems different , isn't it ?

fiona · May 28, 2024

_gabriel said:
Seems different , isn't it ?

Yes, sorry. What I wrote is for the case the VM was already stopped.

sar · Jun 13, 2024

Hi, I faced the same problem. The VM freezes when backing up in STOP mode. Is there any solution to fix this?

sar · Jun 13, 2024

root@pve:~# pveversion -v
proxmox-ve: 8.2.0 (running kernel: 6.8.4-3-pve)
pve-manager: 8.2.2 (running version: 8.2.2/9355359cd7afbae4)
proxmox-kernel-helper: 8.1.0
proxmox-kernel-6.8: 6.8.4-3
proxmox-kernel-6.8.4-3-pve-signed: 6.8.4-3
proxmox-kernel-6.8.4-2-pve-signed: 6.8.4-2
ceph-fuse: 17.2.7-pve3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx8
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-4
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.5.1
libproxmox-backup-qemu0: 1.4.1
libproxmox-rs-perl: 0.3.3
libpve-access-control: 8.1.4
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.6
libpve-cluster-perl: 8.0.6
libpve-common-perl: 8.2.1
libpve-guest-common-perl: 5.1.2
libpve-http-server-perl: 5.1.0
libpve-network-perl: 0.9.8
libpve-rs-perl: 0.8.8
libpve-storage-perl: 8.2.1
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.4.0-3
proxmox-backup-client: 3.2.3-1
proxmox-backup-file-restore: 3.2.3-1
proxmox-kernel-helper: 8.1.0
proxmox-mail-forward: 0.2.3
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.6
proxmox-widget-toolkit: 4.2.3
pve-cluster: 8.0.6
pve-container: 5.1.10
pve-docs: 8.2.2
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.0
pve-firewall: 5.0.7
pve-firmware: 3.11-1
pve-ha-manager: 4.0.4
pve-i18n: 3.2.2
pve-qemu-kvm: 8.1.5-6
pve-xtermjs: 5.3.0-3
qemu-server: 8.2.1
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.3-pve2

sar · Jun 13, 2024

root@pve:~# qm config 112
agent: 1
boot: order=scsi0;net0
cores: 8
cpu: x86-64-v2-AES
description:
lock: backup
memory: 8048
meta: creation-qemu=8.1.5,ctime=1717658352
name: Office2-RC2
net0: virtio=BC:24:11:C4:39:B0,bridge=vmbr2,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: Storage2_nvme:112/vm-112-disk-0.qcow2,iothread=1,size=120G
scsihw: virtio-scsi-single
smbios1: uuid=66e3076c-9524-4a22-8541-93fdabacce05
sockets: 1
vmgenid: 800c6f7d-b071-4441-8901-7695c73814b5

sar · Jun 13, 2024

root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.4-3-pve root=/dev/mapper/pve-root ro quiet

fiona · Jun 13, 2024

Hi,
@sar please share the output of cat /proc/$(cat /var/run/qemu-server/XYZ.pid)/cgroup replacing XYZ with the ID of a VM that was/is affected by the issue and also for some other VM. You need to run it before the backup while the VM is still running.

sar · Jun 13, 2024

@fiona
This is while VM is frezze
root@pve:~# cat /proc/$(cat /var/run/qemu-server/112.pid)/cgroup
0::/system.slice/pvescheduler.service

After restart it and VM is still running
root@pve:~# cat /proc/$(cat /var/run/qemu-server/112.pid)/cgroup
0::/qemu.slice/112.scope

fiona · Jun 13, 2024

The first kind of cgroup entry cannot give us the VM ID, but I'm not able to reproduce the issue (yet). Can you please also share the full backup task log and excerpt from the journal around the time the issue happens (at least from the beginning to end of the relevant backup job)?

sar · Jun 13, 2024

@fiona

log file in attach

In log other VM id, but the same problem.

I sent you log when error started

fiona · Jun 13, 2024

Please also share the full backup task log. You can find it in the node's Task History, filter by Task Type: vzdump.

Should you have another hanging VM, can you please check with cat /var/run/qemu-server/XYZ.pid if the PID matches the one from the qmeventd error messages, and then run

Code:

qm status XYZ --verbose
cat /proc/PID/cmdline
cat /proc/PID/cgroup

EDIT: does using qm resume XYZ make the VM running again?

sar · Jun 13, 2024

@fiona

sar · Jun 13, 2024

@fiona

VM 132

EDIT: does using qm resume XYZ make the VM running again?

YES, after command qm resume 132, VM start running again

VMs hang on shutdown during backup job

New Member

Proxmox Staff Member

New Member

Famous Member

New Member

Famous Member

Proxmox Staff Member

Famous Member

Proxmox Staff Member

New Member

New Member

New Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

New Member

Attachments

Proxmox Staff Member

New Member

Attachments

New Member

Attachments

We value your privacy