VM Cannot resume after upgrade to 8.0

pandada8

Active Member
Jun 25, 2018
13
2
43
27
Code:
Resuming suspended VM
activating and using 'hp13hdd:171/vm-171-state-suspend-2023-07-01.raw' as vmstate
qemu: qemu_mutex_unlock_impl: Operation not permitted
TASK ERROR: start failed: QEMU exited with code 1


Code:
# pveversion --verbose
proxmox-ve: 8.0.1 (running kernel: 6.2.16-3-pve)
pve-manager: 8.0.3 (running version: 8.0.3/bbf3993334bfa916)
pve-kernel-6.2: 8.0.2
pve-kernel-6.1: 7.3-5
pve-kernel-5.15: 7.3-2
pve-kernel-6.2.16-3-pve: 6.2.16-3
pve-kernel-6.1.14-1-pve: 6.1.14-1
pve-kernel-5.15.85-1-pve: 5.15.85-1
pve-kernel-5.15.30-2-pve: 5.15.30-3
ceph: 17.2.6-pve1+3
ceph-fuse: 17.2.6-pve1+3
corosync: 3.1.7-pve3
criu: 3.17.1-2
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-3
libknet1: 1.25-pve1
libproxmox-acme-perl: 1.4.6
libproxmox-backup-qemu0: 1.4.0
libproxmox-rs-perl: 0.3.0
libpve-access-control: 8.0.3
libpve-apiclient-perl: 3.3.1
libpve-common-perl: 8.0.5
libpve-guest-common-perl: 5.0.3
libpve-http-server-perl: 5.0.3
libpve-rs-perl: 0.8.3
libpve-storage-perl: 8.0.2
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 5.0.2-4
lxcfs: 5.0.3-pve3
novnc-pve: 1.4.0-2
openvswitch-switch: 3.1.0-2
proxmox-backup-client: 3.0.1-1
proxmox-backup-file-restore: 3.0.1-1
proxmox-kernel-helper: 8.0.2
proxmox-mail-forward: 0.2.0
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.1
proxmox-widget-toolkit: 4.0.6
pve-cluster: 8.0.1
pve-container: 5.0.4
pve-docs: 8.0.4
pve-edk2-firmware: 3.20230228-4
pve-firewall: 5.0.2
pve-firmware: 3.7-1
pve-ha-manager: 4.0.2
pve-i18n: 3.0.5
pve-qemu-kvm: 8.0.2-3
pve-xtermjs: 4.16.0-3
qemu-server: 8.0.6
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.1.12-pve1
 
The pve7to8 script warns to migrate or stop running VMs. I can't be sure but I assume suspending with 7.4 and resuming with 8.0 is not supported as the QEMU major versions are different.
EDIT: Turns out that my asumption was wrong; it should work.
 
Last edited:
Hi,
The pve7to8 script warns to migrate or stop running VMs. I can't be sure but I assume suspending with 7.4 and resuming with 8.0 is not supported as the QEMU major versions are different.
no, this should work of course and does here in a quick test.

Code:
Resuming suspended VM
activating and using 'hp13hdd:171/vm-171-state-suspend-2023-07-01.raw' as vmstate
qemu: qemu_mutex_unlock_impl: Operation not permitted
TASK ERROR: start failed: QEMU exited with code 1
Please share the VM configuration with qm config 171 --current. Do you remember what QEMU version was running when you hibernated, if you were upgraded to Proxmox VE 7.4 probably QEMU 7.2? What kind of storage is hp13hdd?
 
  • Like
Reactions: leesteken
Code:
agent: 1
boot: order=ide2;scsi0
cores: 12
cpu: host
hotplug: disk,network,usb,memory,cpu
ide2: none,media=cdrom
memory: 10240
name: arch-dev
net0: virtio=9A:54:F2:C7:A5:DA,bridge=vmbr0,firewall=1,tag=1015
numa: 1
onboot: 1
ostype: l26
runningcpu: host,+kvm_pv_eoi,+kvm_pv_unhalt
runningmachine: pc-i440fx-7.2+pve0
scsi0: hp13hdd:171/vm-171-disk-3.qcow2,discard=on,iothread=1,size=97G
scsi1: hp13hdd:171/vm-171-disk-2.qcow2,discard=on,size=100G
scsi2: hp13hdd:171/vm-171-disk-1.qcow2,discard=on,size=100G
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=d56fa681-7a7c-4966-9bdb-432cb394f116
sockets: 1
virtio1: hp13hdd:171/vm-171-disk-0.qcow2,discard=on,size=350G
vmgenid: 5b269dc8-71aa-4e16-b7f4-430d078cd383
watchdog: model=i6300esb,action=reset

hp13hdd is a nfs share of zfs
I got mixed result from suspend / resume in upgrading.
Some are just fine, two node (one E5-2680v2, one EPYC 7702) gave me such `qemu_mutex_unlock_impl` Erorr
Most of the nodes are fine. Unfortunately I already updated all nodes so I cannot make a reproduce anymore
 
I still wasn't able to reproduce the issue, matching your configuration. AFAICT, the error might have something to do with the iothread=1 on the disk (AioContext mutexes used for those can return the error in question if unlocked from the wrong thread):
Code:
       The pthread_mutex_unlock function returns the following error code on error:
              EPERM  the calling thread does not own the mutex (``error checking'' mutexes only).
but locking/unlocking is used all throughout the block layer in QEMU so it's impossible to tell where exactly the issue happened without a stack trace.
 
Pah, after running an minor update (incl. kernel) i'm having the exact same problem. None of my 10 vms could be resumed.
root@hordak:~# qm resume 64025 Resuming suspended VM activating and using 'boris64:64025/vm-64025-state-suspend-2023-07-20.raw' as vmstate swtpm_setup: Not overwriting existing state file. qemu: qemu_mutex_unlock_impl: Operation not permitted stopping swtpm instance (pid 31639) due to QEMU startup error start failed: QEMU exited with code 1

conclusion:
If i remove 'iothread=1' from /etc/pve/qemu-server/${vm}.conf, the hibernation state can be resumed successfully. I'm pretty sure that's _not_ correct the way to do it (and if i'm correct, 'iothread=1' is _the_ default since 7.x?), but it seems to work (for now). My vms are up and running again.

Are there any useful infos that could help debug this issue?
 

Attachments

  • pveversion.txt
    1.3 KB · Views: 5
  • qm-vm-current.txt
    899 bytes · Views: 4
Hi,
so at least we know the issue is not caused by the fact that the hibernation was done with QEMU 7.2. Thank you for sharing the workaround.

What kind of storage is boris64? Also a network storage?

Do you also have a VM without TPM state that is still hibernated? In that case we could try and get some a stack trace via GDB.
 
Hi,
so at least we know the issue is not caused by the fact that the hibernation was done with QEMU 7.2. Thank you for sharing the workaround.

What kind of storage is boris64? Also a network storage?

Do you also have a VM without TPM state that is still hibernated? In that case we could try and get some a stack trace via GDB.
Hey Fiona,

this is a simple storage (Type Directory) with flat qcow2 (etc.) files in it, so nothing fancy (network, hyperconverged or whatever). I'm afraid there's no hibernated vm left, sorry. Also Hibernation seems to work now.

PS: After doing some apt history "investigation", the qemu update as culprit seems to be reasonable. I did update qemu quite some time ago, but didn't do a system reboot or restart the qemu processess (vms).

Anyway, thanks for your input/time!
 
PS: After doing some apt history "investigation", the qemu update as culprit seems to be reasonable. I did update qemu quite some time ago, but didn't do a system reboot or restart the qemu processess (vms).
Hmm, but the configuration you posted shows runningmachine: pc-i440fx-8.0+pve0 so it was already running with an 8.0 QEMU binary before hibernation. Or do you suspect the minor update causing the issue? That one didn't touch any code related to hibernation though. I'd rather suspect the issue is some racy edge case scenario, making it very difficult to reproduce.
 
I ran into the issue while testing something else now. Prerequistes are that you have a drive with iothread and take a PBS backup. When you hibernate or snapshot afterwards, it will not be possible to resume. A patch has been sent to the mailing list: https://lists.proxmox.com/pipermail/pve-devel/2023-July/058562.html

EDIT: Fix is included in pve-qemu-kvm >= 8.0.2-4
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!