VM Cannot resume after upgrade to 8.0

pandada8 · Jul 1, 2023

Code:

Resuming suspended VM
activating and using 'hp13hdd:171/vm-171-state-suspend-2023-07-01.raw' as vmstate
qemu: qemu_mutex_unlock_impl: Operation not permitted
TASK ERROR: start failed: QEMU exited with code 1

Code:

# pveversion --verbose
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.1: 7.3-5
pve-kernel-5.15: 7.3-2
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-6.1.14-1-pve: 6.1.14-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.1
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.1
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1

leesteken · Jul 1, 2023

The pve7to8 script warns to migrate or stop running VMs. I can't be sure but I assume suspending with 7.4 and resuming with 8.0 is not supported as the QEMU major versions are different.
EDIT: Turns out that my asumption was wrong; it should work.

fiona · Jul 3, 2023

Hi,

leesteken said:
The pve7to8 script warns to migrate or stop running VMs. I can't be sure but I assume suspending with 7.4 and resuming with 8.0 is not supported as the QEMU major versions are different.

no, this should work of course and does here in a quick test.

pandada8 said:

Code:

Resuming suspended VM
activating and using 'hp13hdd:171/vm-171-state-suspend-2023-07-01.raw' as vmstate
qemu: qemu_mutex_unlock_impl: Operation not permitted
TASK ERROR: start failed: QEMU exited with code 1

Please share the VM configuration with qm config 171 --current. Do you remember what QEMU version was running when you hibernated, if you were upgraded to Proxmox VE 7.4 probably QEMU 7.2? What kind of storage is hp13hdd?

pandada8 · Jul 3, 2023

Code:

agent: 1
boot: order=ide2;scsi0
cores: 12
cpu: host
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
memory: 10240
name: arch-dev
net0: virtio=9A:54:F2:C7:A5:DA,bridge=vmbr0,firewall=1,tag=1015
numa: 1
onboot: 1
ostype: l26
runningcpu: host,+kvm_pv_eoi,+kvm_pv_unhalt
runningmachine: pc-i440fx-7.2+pve0
scsi0: hp13hdd:171/vm-171-disk-3.qcow2,discard=on,iothread=1,size=97G
scsi1: hp13hdd:171/vm-171-disk-2.qcow2,discard=on,size=100G
scsi2: hp13hdd:171/vm-171-disk-1.qcow2,discard=on,size=100G
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=d56fa681-7a7c-4966-9bdb-432cb394f116
sockets: 1
virtio1: hp13hdd:171/vm-171-disk-0.qcow2,discard=on,size=350G
vmgenid: 5b269dc8-71aa-4e16-b7f4-430d078cd383
watchdog: model=i6300esb,action=reset

hp13hdd is a nfs share of zfs
I got mixed result from suspend / resume in upgrading.
Some are just fine, two node (one E5-2680v2, one EPYC 7702) gave me such `qemu_mutex_unlock_impl` Erorr
Most of the nodes are fine. Unfortunately I already updated all nodes so I cannot make a reproduce anymore

fiona · Jul 7, 2023

I still wasn't able to reproduce the issue, matching your configuration. AFAICT, the error might have something to do with the iothread=1 on the disk (AioContext mutexes used for those can return the error in question if unlocked from the wrong thread):

Code:

       The pthread_mutex_unlock function returns the following error code on error:
              EPERM  the calling thread does not own the mutex (``error checking'' mutexes only).

but locking/unlocking is used all throughout the block layer in QEMU so it's impossible to tell where exactly the issue happened without a stack trace.

boris64 · Jul 20, 2023

Pah, after running an minor update (incl. kernel) i'm having the exact same problem. None of my 10 vms could be resumed.

root@hordak:~# qm resume 64025
Resuming suspended VM
activating and using 'boris64:64025/vm-64025-state-suspend-2023-07-20.raw' as vmstate
swtpm_setup: Not overwriting existing state file.
qemu: qemu_mutex_unlock_impl: Operation not permitted
stopping swtpm instance (pid 31639) due to QEMU startup error
start failed: QEMU exited with code 1

conclusion:
If i remove 'iothread=1' from /etc/pve/qemu-server/${vm}.conf, the hibernation state can be resumed successfully. I'm pretty sure that's _not_ correct the way to do it (and if i'm correct, 'iothread=1' is _the_ default since 7.x?), but it seems to work (for now). My vms are up and running again.

Are there any useful infos that could help debug this issue?

fiona · Jul 20, 2023

Hi,
so at least we know the issue is not caused by the fact that the hibernation was done with QEMU 7.2. Thank you for sharing the workaround.

What kind of storage is boris64? Also a network storage?

Do you also have a VM without TPM state that is still hibernated? In that case we could try and get some a stack trace via GDB.

boris64 · Jul 20, 2023

fiona said:
Hi,
so at least we know the issue is not caused by the fact that the hibernation was done with QEMU 7.2. Thank you for sharing the workaround.

What kind of storage is boris64? Also a network storage?

Do you also have a VM without TPM state that is still hibernated? In that case we could try and get some a stack trace via GDB.

Hey Fiona,

this is a simple storage (Type Directory) with flat qcow2 (etc.) files in it, so nothing fancy (network, hyperconverged or whatever). I'm afraid there's no hibernated vm left, sorry. Also Hibernation seems to work now.

PS: After doing some apt history "investigation", the qemu update as culprit seems to be reasonable. I did update qemu quite some time ago, but didn't do a system reboot or restart the qemu processess (vms).

Anyway, thanks for your input/time!

fiona · Jul 21, 2023

boris64 said:
PS: After doing some apt history "investigation", the qemu update as culprit seems to be reasonable. I did update qemu quite some time ago, but didn't do a system reboot or restart the qemu processess (vms).

Hmm, but the configuration you posted shows runningmachine: pc-i440fx-8.0+pve0 so it was already running with an 8.0 QEMU binary before hibernation. Or do you suspect the minor update causing the issue? That one didn't touch any code related to hibernation though. I'd rather suspect the issue is some racy edge case scenario, making it very difficult to reproduce.

fiona · Jul 28, 2023

I ran into the issue while testing something else now. Prerequistes are that you have a drive with iothread and take a PBS backup. When you hibernate or snapshot afterwards, it will not be possible to resume. A patch has been sent to the mailing list: https://lists.proxmox.com/pipermail/pve-devel/2023-July/058562.html

EDIT: Fix is included in pve-qemu-kvm >= 8.0.2-4

Search

Search

VM Cannot resume after upgrade to 8.0

pandada8

Active Member

leesteken

Distinguished Member

fiona

Proxmox Staff Member

pandada8

Active Member

fiona

Proxmox Staff Member

boris64

Member

Attachments

fiona

Proxmox Staff Member

boris64

Member

fiona

Proxmox Staff Member

fiona

Proxmox Staff Member