query-savevm

mrapajic

Active Member
Dec 20, 2019
18
0
41
Hi guys,

running the newest version of Proxmox VE (enterprize repo).
Local disks with ZFS, qemu-agent installed, guest OS is Debian 10.

After taking a snapshot I got the following message and the VM stopped

Code:
()
saving VM state and RAM using storage 'local-zfs'
VM xx not running
snapshot create failed: starting cleanup
TASK ERROR: VM xx qmp command 'query-savevm' failed - client closed connection


Proxmox version

Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-3
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-3
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1


Any ideas?
 
Could you post your journal log from when the crash occured? It should have written a reason for the VM stopping (you can grep for "QEMU").

Also, did you potentially live-migrate that machine before taking the snapshot?
 
Hi Stefan,

i did a live migrate from another node on the previous day (20h gap at least). I have replication every 5 min of this VM going on another node. Here is the journal log:

Code:
Mar 11 08:24:02 proxmox1 pvedaemon[2929]: <user@pam> snapshot VM 105: before_upgrade
Mar 11 08:24:02 proxmox1 pvedaemon[33176]: <user@pam> starting task UPID:proxmox1:00000B71:0073BEB2:6049C592:qmsnapshot:105:user@pam:
Mar 11 08:24:02 proxmox1 zed[3214]: eid=12691 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:03 proxmox1 zed[3378]: eid=12692 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:03 proxmox1 QEMU[27000]: **
Mar 11 08:24:03 proxmox1 QEMU[27000]: ERROR:/home/builder/source/pve-qemu-kvm-5.1.0/softmmu/cpus.c:1781:qemu_mutex_lock_iothread_impl: assertion failed: (!qemu_mutex_iothread_locked())
Mar 11 08:24:04 proxmox1 zed[3616]: eid=12693 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:04 proxmox1 zed[3624]: eid=12694 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:06 proxmox1 zed[4042]: eid=12695 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:06 proxmox1 zed[4121]: eid=12696 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:07 proxmox1 zed[4601]: eid=12697 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:09 proxmox1 kernel: vmbr1v751: port 2(tap105i1) entered disabled state
Mar 11 08:24:09 proxmox1 kernel: vmbr1v751: port 2(tap105i1) entered disabled state
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 qmp command failed - VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 qmp command failed - VM 105 not running
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 not running
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: snapshot create failed: starting cleanup
Mar 11 08:24:09 proxmox1 kernel: vmbr1v19: port 3(tap105i0) entered disabled state
Mar 11 08:24:09 proxmox1 kernel: vmbr1v19: port 3(tap105i0) entered disabled state
Mar 11 08:24:09 proxmox1 systemd[1]: 105.scope: Succeeded.
Mar 11 08:24:09 proxmox1 zed[4978]: eid=12698 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:09 proxmox1 systemd[1]: pvesr.service: Succeeded.
Mar 11 08:24:09 proxmox1 systemd[1]: Started Proxmox VE replication runner.
Mar 11 08:24:10 proxmox1 pvedaemon[2929]: VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:10 proxmox1 pvedaemon[33176]: <user@pam> end task UPID:proxmox1:00000B71:0073BEB2:6049C592:qmsnapshot:105:user@pam: VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:10 proxmox1 qmeventd[2557]: Starting cleanup for 105
Mar 11 08:24:10 proxmox1 qmeventd[2557]: Finished cleanup for 105
Mar 11 08:24:20 proxmox1 pvedaemon[6089]: start VM 105: UPID:proxmox1:000017C9:0073C5CE:6049C5A4:qmstart:105:user@pam:
Mar 11 08:24:20 proxmox1 pvedaemon[30604]: <user@pam> starting task UPID:proxmox1:000017C9:0073C5CE:6049C5A4:qmstart:105:user@pam:
 
Thanks for the info! Your logs line up with an issue we are aware of and currently investigating, I'll inform you once we figure out a fix. The exact reproducer seems to be: Start a VM, make a backup to a PBS instance, live-migrate, then try to take a snapshot with RAM state, if you're curious as to how to avoid it in the meantime...
 
Hi Stefan,

I can confirm all of the above you mentioned were involved "Start a VM, make a backup to a PBS instance, live-migrate, then try to take a snapshot with RAM state". So the workaround would be a snapshot without RAM state?
 
That should work. Anything breaking the chain of events I mentioned seems to avoid the issue for now.
 
Is it possible that it is due also to the fact the vm I try to snapshot has windows server on it and 96gb of ram?
 
Ok infact today I have many VMs blocked. It is correlated probably to backups because now I get for each VM: "qmp command 'query-backup' failed - got timeout" from backup task that uses PBS.
 
I suspect that when you ask for a snapshot (it another thread is confirmed that snapshots WITH ram do not work anymore) or for a bitmap (to do differential backups) the VM hangs.
 
Ok infact today I have many VMs blocked. It is correlated probably to backups because now I get for each VM: "qmp command 'query-backup' failed - got timeout" from backup task that uses PBS.
Maybe your problem is the RAM side of the PBS. Try to add more RAM on the Proxmox Backup SERVER. Hade the same problem.
 
Hi @Stefan_R,

we are currently running into the same Snapshot-and-Crash issue. We reproduced the problem through the steps as mentioned above:
  1. Start VM
    • Test Snapshot with RAM: works
  2. Backup VM to PBS
    • Test Snapshot with RAM: works
  3. Live-Migrate VM to another PVE-Node
    • Test Snapshot with RAM: FAILS (+ VM ends up stopped)
You mention the fix discussed in related thread "All VMs locking up after latest PVE update" should also help with this snapshot issue.

According to the same thread, a rollback to pve-qemu-kvm=5.1.0-8 and libproxmox-backup-qemu0=1.0.2-1 should also help with this issue.

However, we have never upgrade past those versions, running on Enterprise Repository, and still experience the same problem. Maybe some other factor is of importance here in addition to the other discussion?

Best regards,
Andreas

P.S.:

Snapshot error
Code:
saving VM state and RAM using storage 'rbd01_vm'
VM xx not running
snapshot create failed: starting cleanup
2021-04-09 15:13:16.525 7f8086ffd700 -1 librbd::image::PreRemoveRequest: 0x559fc1d94b60 handle_exclusive_lock: cannot obtain exclusive lock - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
error during cfs-locked 'storage-rbd01_vm' operation: rbd rm 'vm-xx-state-snapshot_test_after_pbs_after_migrate' error: rbd: error: image still has watchers
TASK ERROR: VM xx qmp command 'query-savevm' failed - client closed connection

pveversion -v
Code:
proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
ceph: 14.2.19-pve1
ceph-fuse: 14.2.19-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.13-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1
 
You mention the fix discussed in related thread "All VMs locking up after latest PVE update" should also help with this snapshot issue.
I only mentioned it because we pushed the fix in the same package version (pve-qemu-kvm 5.2.0-4). This version is currently only available in the no-subscription repository. The fix itself was a different one, the issues are unrelated.
 
I only mentioned it because we pushed the fix in the same package version (pve-qemu-kvm 5.2.0-4). This version is currently only available in the no-subscription repository. The fix itself was a different one, the issues are unrelated.
Hi @Stefan_R

Thanks for the clarification!

Is there an estimate when the fixed version will arrive in the enterprise repository? (I didn't see it at the time of this writing.)

Alternatively, how safe is it for our production stability to manually pick that single package from no-subscription?

Best Regards,
Andreas
 
Is there an estimate when the fixed version will arrive in the enterprise repository? (I didn't see it at the time of this writing.)
Generally no estimates from our side, but so far we've received positive feedback from the brave people running it already ;)

In terms of compatibility it is okay to pick the package, pve-qemu-kvm does not have many hard dependencies (though you might need 'libproxmox-backup-qemu0' as well). Generally speaking it is not recommended of course, pve-enterprise should always be used for stability, but if you need this fix for your current setup it can make sense. Definitely test out the version on a test system beforehand, everyones workload is different, and you might run into issues others haven't. If you do, let us know :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!