query-savevm

mrapajic · Mar 11, 2021

Hi guys,

running the newest version of Proxmox VE (enterprize repo).
Local disks with ZFS, qemu-agent installed, guest OS is Debian 10.

After taking a snapshot I got the following message and the VM stopped

Code:

()
saving VM state and RAM using storage 'local-zfs'
VM xx not running
snapshot create failed: starting cleanup
TASK ERROR: VM xx qmp command 'query-savevm' failed - client closed connection

Proxmox version

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.78-2-pve)
pve-manager: 6.3-3 (running version: 6.3-3/eee5f901)
pve-kernel-5.4: 6.3-3
pve-kernel-helper: 6.3-3
pve-kernel-5.3: 6.1-6
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-5.4.73-1-pve: 5.4.73-1
pve-kernel-5.4.65-1-pve: 5.4.65-1
pve-kernel-5.4.60-1-pve: 5.4.60-2
pve-kernel-5.4.55-1-pve: 5.4.55-1
pve-kernel-5.4.44-2-pve: 5.4.44-2
pve-kernel-5.4.41-1-pve: 5.4.41-1
pve-kernel-5.4.34-1-pve: 5.4.34-2
pve-kernel-5.3.18-3-pve: 5.3.18-3
pve-kernel-5.3.18-2-pve: 5.3.18-2
pve-kernel-5.3.18-1-pve: 5.3.18-1
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.7
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-3
libpve-guest-common-perl: 3.1-4
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-6
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
proxmox-backup-client: 1.0.8-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-5
pve-cluster: 6.2-1
pve-container: 3.3-3
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.1-3
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-5
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 0.8.5-pve1

Any ideas?

Stefan_R · Mar 11, 2021

Could you post your journal log from when the crash occured? It should have written a reason for the VM stopping (you can grep for "QEMU").

Also, did you potentially live-migrate that machine before taking the snapshot?

mrapajic · Mar 11, 2021

Hi Stefan,

i did a live migrate from another node on the previous day (20h gap at least). I have replication every 5 min of this VM going on another node. Here is the journal log:

Code:

Mar 11 08:24:02 proxmox1 pvedaemon[2929]: <user@pam> snapshot VM 105: before_upgrade
Mar 11 08:24:02 proxmox1 pvedaemon[33176]: <user@pam> starting task UPID:proxmox1:00000B71:0073BEB2:6049C592:qmsnapshot:105:user@pam:
Mar 11 08:24:02 proxmox1 zed[3214]: eid=12691 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:03 proxmox1 zed[3378]: eid=12692 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:03 proxmox1 QEMU[27000]: **
Mar 11 08:24:03 proxmox1 QEMU[27000]: ERROR:/home/builder/source/pve-qemu-kvm-5.1.0/softmmu/cpus.c:1781:qemu_mutex_lock_iothread_impl: assertion failed: (!qemu_mutex_iothread_locked())
Mar 11 08:24:04 proxmox1 zed[3616]: eid=12693 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:04 proxmox1 zed[3624]: eid=12694 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:06 proxmox1 zed[4042]: eid=12695 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:06 proxmox1 zed[4121]: eid=12696 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:07 proxmox1 zed[4601]: eid=12697 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:09 proxmox1 kernel: vmbr1v751: port 2(tap105i1) entered disabled state
Mar 11 08:24:09 proxmox1 kernel: vmbr1v751: port 2(tap105i1) entered disabled state
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 qmp command failed - VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 qmp command failed - VM 105 not running
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: VM 105 not running
Mar 11 08:24:09 proxmox1 pvedaemon[2929]: snapshot create failed: starting cleanup
Mar 11 08:24:09 proxmox1 kernel: vmbr1v19: port 3(tap105i0) entered disabled state
Mar 11 08:24:09 proxmox1 kernel: vmbr1v19: port 3(tap105i0) entered disabled state
Mar 11 08:24:09 proxmox1 systemd[1]: 105.scope: Succeeded.
Mar 11 08:24:09 proxmox1 zed[4978]: eid=12698 class=history_event pool_guid=0x27863ED92DE8C45C
Mar 11 08:24:09 proxmox1 systemd[1]: pvesr.service: Succeeded.
Mar 11 08:24:09 proxmox1 systemd[1]: Started Proxmox VE replication runner.
Mar 11 08:24:10 proxmox1 pvedaemon[2929]: VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:10 proxmox1 pvedaemon[33176]: <user@pam> end task UPID:proxmox1:00000B71:0073BEB2:6049C592:qmsnapshot:105:user@pam: VM 105 qmp command 'query-savevm' failed - client closed connection
Mar 11 08:24:10 proxmox1 qmeventd[2557]: Starting cleanup for 105
Mar 11 08:24:10 proxmox1 qmeventd[2557]: Finished cleanup for 105
Mar 11 08:24:20 proxmox1 pvedaemon[6089]: start VM 105: UPID:proxmox1:000017C9:0073C5CE:6049C5A4:qmstart:105:user@pam:
Mar 11 08:24:20 proxmox1 pvedaemon[30604]: <user@pam> starting task UPID:proxmox1:000017C9:0073C5CE:6049C5A4:qmstart:105:user@pam:

Stefan_R · Mar 15, 2021

Thanks for the info! Your logs line up with an issue we are aware of and currently investigating, I'll inform you once we figure out a fix. The exact reproducer seems to be: Start a VM, make a backup to a PBS instance, live-migrate, then try to take a snapshot with RAM state, if you're curious as to how to avoid it in the meantime...

mrapajic · Mar 16, 2021

Hi Stefan,

I can confirm all of the above you mentioned were involved "Start a VM, make a backup to a PBS instance, live-migrate, then try to take a snapshot with RAM state". So the workaround would be a snapshot without RAM state?

Stefan_R · Mar 16, 2021

That should work. Anything breaking the chain of events I mentioned seems to avoid the issue for now.

mgiammarco · Mar 21, 2021

I have exactly the same problem so I am interested too.

mgiammarco · Mar 21, 2021

Is it possible that it is due also to the fact the vm I try to snapshot has windows server on it and 96gb of ram?

mrapajic · Mar 22, 2021

mgiammarco said:
Is it possible that it is due also to the fact the vm I try to snapshot has windows server on it and 96gb of ram?

Probably not. Same thing happens on VM-s with Linux guest OS. Just break the chain like mentioned above.

mgiammarco · Mar 22, 2021

Ok infact today I have many VMs blocked. It is correlated probably to backups because now I get for each VM: "qmp command 'query-backup' failed - got timeout" from backup task that uses PBS.

mgiammarco · Mar 22, 2021

I suspect that when you ask for a snapshot (it another thread is confirmed that snapshots WITH ram do not work anymore) or for a bitmap (to do differential backups) the VM hangs.

Stefan_R · Mar 22, 2021

Sorry, forgot to ping here, but a fix for this specific issue has been applied to our QEMU branch and should be available in repositories soon-ish (we'll probably wait for the fix for the total VM hang issue to push). For reference: https://lists.proxmox.com/pipermail/pve-devel/2021-March/047338.html

mrapajic · Mar 22, 2021

mgiammarco said:
Ok infact today I have many VMs blocked. It is correlated probably to backups because now I get for each VM: "qmp command 'query-backup' failed - got timeout" from backup task that uses PBS.

Maybe your problem is the RAM side of the PBS. Try to add more RAM on the Proxmox Backup SERVER. Hade the same problem.

Andreas Bochem · Apr 9, 2021

Hi @Stefan_R,

we are currently running into the same Snapshot-and-Crash issue. We reproduced the problem through the steps as mentioned above:

Start VM
- Test Snapshot with RAM: works
Backup VM to PBS
- Test Snapshot with RAM: works
Live-Migrate VM to another PVE-Node
- Test Snapshot with RAM: FAILS (+ VM ends up stopped)

You mention the fix discussed in related thread "All VMs locking up after latest PVE update" should also help with this snapshot issue.

According to the same thread, a rollback to pve-qemu-kvm=5.1.0-8 and libproxmox-backup-qemu0=1.0.2-1 should also help with this issue.

However, we have never upgrade past those versions, running on Enterprise Repository, and still experience the same problem. Maybe some other factor is of importance here in addition to the other discussion?

Best regards,
Andreas

P.S.:

Snapshot error

Code:

saving VM state and RAM using storage 'rbd01_vm'
VM xx not running
snapshot create failed: starting cleanup
2021-04-09 15:13:16.525 7f8086ffd700 -1 librbd::image::PreRemoveRequest: 0x559fc1d94b60 handle_exclusive_lock: cannot obtain exclusive lock - not removing
Removing image: 0% complete...failed.
rbd: error: image still has watchers
error during cfs-locked 'storage-rbd01_vm' operation: rbd rm 'vm-xx-state-snapshot_test_after_pbs_after_migrate' error: rbd: error: image still has watchers
TASK ERROR: VM xx qmp command 'query-savevm' failed - client closed connection

pveversion -v

Code:

proxmox-ve: 6.3-1 (running kernel: 5.4.106-1-pve)
pve-manager: 6.3-6 (running version: 6.3-6/2184247e)
pve-kernel-5.4: 6.3-8
pve-kernel-helper: 6.3-8
pve-kernel-5.4.106-1-pve: 5.4.106-1
pve-kernel-5.4.78-2-pve: 5.4.78-2
pve-kernel-4.15: 5.4-18
pve-kernel-4.15.18-29-pve: 4.15.18-57
ceph: 14.2.19-pve1
ceph-fuse: 14.2.19-pve1
corosync: 3.1.0-pve1
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.20-pve1
libproxmox-acme-perl: 1.0.8
libproxmox-backup-qemu0: 1.0.2-1
libpve-access-control: 6.1-3
libpve-apiclient-perl: 3.1-3
libpve-common-perl: 6.3-5
libpve-guest-common-perl: 3.1-5
libpve-http-server-perl: 3.1-1
libpve-storage-perl: 6.3-7
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve4
lxc-pve: 4.0.6-2
lxcfs: 4.0.6-pve1
novnc-pve: 1.1.0-1
openvswitch-switch: 2.12.3-1
proxmox-backup-client: 1.0.13-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.4-9
pve-cluster: 6.2-1
pve-container: 3.3-4
pve-docs: 6.3-1
pve-edk2-firmware: 2.20200531-1
pve-firewall: 4.1-3
pve-firmware: 3.2-2
pve-ha-manager: 3.1-1
pve-i18n: 2.2-2
pve-qemu-kvm: 5.1.0-8
pve-xtermjs: 4.7.0-3
qemu-server: 6.3-8
smartmontools: 7.2-pve2
spiceterm: 3.1-1
vncterm: 1.6-2
zfsutils-linux: 2.0.4-pve1

Stefan_R · Apr 12, 2021

Andreas Bochem said:
You mention the fix discussed in related thread "All VMs locking up after latest PVE update" should also help with this snapshot issue.

I only mentioned it because we pushed the fix in the same package version (pve-qemu-kvm 5.2.0-4). This version is currently only available in the no-subscription repository. The fix itself was a different one, the issues are unrelated.

Andreas Bochem · Apr 12, 2021

Stefan_R said:
I only mentioned it because we pushed the fix in the same package version (pve-qemu-kvm 5.2.0-4). This version is currently only available in the no-subscription repository. The fix itself was a different one, the issues are unrelated.

Hi @Stefan_R

Thanks for the clarification!

Is there an estimate when the fixed version will arrive in the enterprise repository? (I didn't see it at the time of this writing.)

Alternatively, how safe is it for our production stability to manually pick that single package from no-subscription?

Best Regards,
Andreas

Stefan_R · Apr 12, 2021

Andreas Bochem said:
Is there an estimate when the fixed version will arrive in the enterprise repository? (I didn't see it at the time of this writing.)

Generally no estimates from our side, but so far we've received positive feedback from the brave people running it already

In terms of compatibility it is okay to pick the package, pve-qemu-kvm does not have many hard dependencies (though you might need 'libproxmox-backup-qemu0' as well). Generally speaking it is not recommended of course, pve-enterprise should always be used for stability, but if you need this fix for your current setup it can make sense. Definitely test out the version on a test system beforehand, everyones workload is different, and you might run into issues others haven't. If you do, let us know

Search

Search

query-savevm

mrapajic

New Member

Stefan_R

Proxmox Retired Staff

mrapajic

New Member

Stefan_R

Proxmox Retired Staff

mrapajic

New Member

Stefan_R

Proxmox Retired Staff

mgiammarco

Renowned Member

mgiammarco

Renowned Member

mrapajic

New Member

mgiammarco

Renowned Member

mgiammarco

Renowned Member

Stefan_R

Proxmox Retired Staff

mrapajic

New Member

Andreas Bochem

Member

Stefan_R

Proxmox Retired Staff

Andreas Bochem

Member

Stefan_R

Proxmox Retired Staff