QCOW2 Snapshot Deletion Causes VM To Temporarily Lose Ethernet Connection

JBdrakaris · Oct 13, 2020

Good morning,

We have a database cluster consisting of four hosts all running running 6.2-11. The hosts are paired, so master-1 to slave-1, and master-2 to slave-2. They all run debian VM's on a qcow2 format, running PostGreSQL v12., with the masters replicating to the slaves. The large ones are around 4 TB in size. There is no ProxMox HA involved, as the replication is down via PostGreSQL.

A snapshot was taken on a few of the vm's while being built and not deleted before before going into production. When we realized this oversight, though the web-gui, we tested deletion of the snapshot on one of the slaves. All was well for a few minutes, but then the vm lost ethernet connection, and the alerts started rolling in. After a few more minutes the gui issued a timeout error. Eventually, after more than 15 minutes, the machine came back online. Testing showed it was "OK" and the databse synced with the master. We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".

This leads to where we are now: We want to delete the snapshots on the other VM's, but the masters are production and cannot come down. Has this lost ethernet connection happened to anyone else? Is this a result of the multi-TB size of the image? Is there a way to prevent this ethernet loss from happening while deleting the snapshot? Should we just live with it and not delete the snapshots? We were considering scheduling a maintenance window, powering down the VM, and deleting the snapshot. Would this be any better than a "live" deletion?

Any and all help and comments sincerely appreciated.

Thanks!

Stefan_R · Oct 15, 2020

JBdrakaris said:
We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".

I assume that means that the 'snapshot delete' task failed? Otherwise the lock should be remove automatically and the config updated...

You might have encountered a bug in QEMU then. Can you reproduce the behaviour? (I.e. take another snapshot, then delete it?) Also, was the VM responsive during the time, and just the network down, or did it stop completely? (is it accessible via VNC)

Also, check any and all log files, both from the host and the guest to see if anything useful is to be found.

rojoblandino · Apr 29, 2021

I have many VM with many drives with snapshots on them but we need to remove the snapshots, I am in the same position of "JBdrakaris", I need to delete the snapshots while the VM is running.

I was reading but only about the lock was answered and I would like to know about possible snapshot deletion on running KVM's.

The snapshots are not shown on GUI but "qemu-img info" show them, one of the disk is a 2T but the qcow2 disk show as 3T and is using too much space.

Can the snapshot be removed with "qemu-img snapshot -d" while the kvm is running?

Do i need to suspend the KVM for that task?

is it possible?

If is possible, what should I not do, and what are the right steps to do it?

Stefan_R · May 3, 2021

rojoblandino said:
I have many VM with many drives with snapshots on them but we need to remove the snapshots, I am in the same position of "JBdrakaris", I need to delete the snapshots while the VM is running.

I was reading but only about the lock was answered and I would like to know about possible snapshot deletion on running KVM's.

The snapshots are not shown on GUI but "qemu-img info" show them, one of the disk is a 2T but the qcow2 disk show as 3T and is using too much space.

Can the snapshot be removed with "qemu-img snapshot -d" while the kvm is running?

Do i need to suspend the KVM for that task?

is it possible?

If is possible, what should I not do, and what are the right steps to do it?

In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.

rojoblandino · May 3, 2021

Stefan_R said:
In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.

The author of the thread asked for two things of which only one was answered, the second about doing it live was not answered, so I asked about the second.

I have received the answer for which I am enormously grateful, thank you, I supposed it can be runned safetly on a running VM.

rojoblandino · May 5, 2021

Stefan_R said:
In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.

I have run the command but i have got following message:

VM 100 qmp command 'human-monitor-command' failed - unable to connect to VM 100 qmp socket

Now i do not see the snapshot but on the image info i see following data:

Code:

image: /var/lib/vz/images/100/vm-100-disk-1.qcow2
file format: qcow2
virtual size: 1 TiB (1099511627776 bytes)
disk size: 1.43 TiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false

Stefan_R · May 5, 2021

rojoblandino said:
VM 100 qmp command 'human-monitor-command' failed - unable to connect to VM 100 qmp socket

Is the VM you ran this on running? Anything in the logs (journalctl)? The HMP should always work on running VMs...

The image data doesn't really help in this scenario I believe? Especially if you ran qemu-img on the disk while the VM was running, as that can lead to wrong results anyway. Either shut down the VM first, or use the HMP or QMP interface as described.

Search

Search

QCOW2 Snapshot Deletion Causes VM To Temporarily Lose Ethernet Connection

JBdrakaris

New Member

Stefan_R

Proxmox Retired Staff

rojoblandino

Member

Stefan_R

Proxmox Retired Staff

rojoblandino

Member

rojoblandino

Member

Stefan_R

Proxmox Retired Staff

We value your privacy