QCOW2 Snapshot Deletion Causes VM To Temporarily Lose Ethernet Connection

JBdrakaris

New Member
Nov 7, 2019
6
0
1
60
Good morning,

We have a database cluster consisting of four hosts all running running 6.2-11. The hosts are paired, so master-1 to slave-1, and master-2 to slave-2. They all run debian VM's on a qcow2 format, running PostGreSQL v12., with the masters replicating to the slaves. The large ones are around 4 TB in size. There is no ProxMox HA involved, as the replication is down via PostGreSQL.

A snapshot was taken on a few of the vm's while being built and not deleted before before going into production. When we realized this oversight, though the web-gui, we tested deletion of the snapshot on one of the slaves. All was well for a few minutes, but then the vm lost ethernet connection, and the alerts started rolling in. After a few more minutes the gui issued a timeout error. Eventually, after more than 15 minutes, the machine came back online. Testing showed it was "OK" and the databse synced with the master. We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".

This leads to where we are now: We want to delete the snapshots on the other VM's, but the masters are production and cannot come down. Has this lost ethernet connection happened to anyone else? Is this a result of the multi-TB size of the image? Is there a way to prevent this ethernet loss from happening while deleting the snapshot? Should we just live with it and not delete the snapshots? We were considering scheduling a maintenance window, powering down the VM, and deleting the snapshot. Would this be any better than a "live" deletion?

Any and all help and comments sincerely appreciated.

Thanks!
 

Stefan_R

Proxmox Staff Member
Staff member
Jun 4, 2019
1,025
203
63
Vienna
We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".
I assume that means that the 'snapshot delete' task failed? Otherwise the lock should be remove automatically and the config updated...

You might have encountered a bug in QEMU then. Can you reproduce the behaviour? (I.e. take another snapshot, then delete it?) Also, was the VM responsive during the time, and just the network down, or did it stop completely? (is it accessible via VNC)

Also, check any and all log files, both from the host and the guest to see if anything useful is to be found.
 

rojoblandino

New Member
Sep 11, 2019
23
4
3
36
I have many VM with many drives with snapshots on them but we need to remove the snapshots, I am in the same position of "JBdrakaris", I need to delete the snapshots while the VM is running.

I was reading but only about the lock was answered and I would like to know about possible snapshot deletion on running KVM's.

The snapshots are not shown on GUI but "qemu-img info" show them, one of the disk is a 2T but the qcow2 disk show as 3T and is using too much space.

Can the snapshot be removed with "qemu-img snapshot -d" while the kvm is running?

Do i need to suspend the KVM for that task?

is it possible?

If is possible, what should I not do, and what are the right steps to do it?
 
Last edited:

Stefan_R

Proxmox Staff Member
Staff member
Jun 4, 2019
1,025
203
63
Vienna
I have many VM with many drives with snapshots on them but we need to remove the snapshots, I am in the same position of "JBdrakaris", I need to delete the snapshots while the VM is running.

I was reading but only about the lock was answered and I would like to know about possible snapshot deletion on running KVM's.

The snapshots are not shown on GUI but "qemu-img info" show them, one of the disk is a 2T but the qcow2 disk show as 3T and is using too much space.

Can the snapshot be removed with "qemu-img snapshot -d" while the kvm is running?

Do i need to suspend the KVM for that task?

is it possible?

If is possible, what should I not do, and what are the right steps to do it?
In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.
 

rojoblandino

New Member
Sep 11, 2019
23
4
3
36
In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.
The author of the thread asked for two things of which only one was answered, the second about doing it live was not answered, so I asked about the second.

I have received the answer for which I am enormously grateful, thank you, I supposed it can be runned safetly on a running VM.
 
Last edited:

rojoblandino

New Member
Sep 11, 2019
23
4
3
36
In the future, please open a new thread for not fully related questions instead of bumping a months-old one.

In general, using qemu-img on a disk image while the VM is running is highly unsafe, instead, use the QMP or monitor interface of the running QEMU instance. Cursory look shows me "snapshot_delete_blkdev_internal" exists as a command on the monitor interface, i.e. the monitor tab in PVE for your VM.
I have run the command but i have got following message:

VM 100 qmp command 'human-monitor-command' failed - unable to connect to VM 100 qmp socket

Now i do not see the snapshot but on the image info i see following data:

Code:
image: /var/lib/vz/images/100/vm-100-disk-1.qcow2
file format: qcow2
virtual size: 1 TiB (1099511627776 bytes)
disk size: 1.43 TiB
cluster_size: 65536
Format specific information:
    compat: 1.1
    compression type: zlib
    lazy refcounts: false
    refcount bits: 16
    corrupt: false
    extended l2: false
 
Last edited:

Stefan_R

Proxmox Staff Member
Staff member
Jun 4, 2019
1,025
203
63
Vienna
VM 100 qmp command 'human-monitor-command' failed - unable to connect to VM 100 qmp socket
Is the VM you ran this on running? Anything in the logs (journalctl)? The HMP should always work on running VMs...

The image data doesn't really help in this scenario I believe? Especially if you ran qemu-img on the disk while the VM was running, as that can lead to wrong results anyway. Either shut down the VM first, or use the HMP or QMP interface as described.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE and Proxmox Mail Gateway. We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get your own in 60 seconds.

Buy now!