Good morning,
We have a database cluster consisting of four hosts all running running 6.2-11. The hosts are paired, so master-1 to slave-1, and master-2 to slave-2. They all run debian VM's on a qcow2 format, running PostGreSQL v12., with the masters replicating to the slaves. The large ones are around 4 TB in size. There is no ProxMox HA involved, as the replication is down via PostGreSQL.
A snapshot was taken on a few of the vm's while being built and not deleted before before going into production. When we realized this oversight, though the web-gui, we tested deletion of the snapshot on one of the slaves. All was well for a few minutes, but then the vm lost ethernet connection, and the alerts started rolling in. After a few more minutes the gui issued a timeout error. Eventually, after more than 15 minutes, the machine came back online. Testing showed it was "OK" and the databse synced with the master. We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".
This leads to where we are now: We want to delete the snapshots on the other VM's, but the masters are production and cannot come down. Has this lost ethernet connection happened to anyone else? Is this a result of the multi-TB size of the image? Is there a way to prevent this ethernet loss from happening while deleting the snapshot? Should we just live with it and not delete the snapshots? We were considering scheduling a maintenance window, powering down the VM, and deleting the snapshot. Would this be any better than a "live" deletion?
Any and all help and comments sincerely appreciated.
Thanks!
We have a database cluster consisting of four hosts all running running 6.2-11. The hosts are paired, so master-1 to slave-1, and master-2 to slave-2. They all run debian VM's on a qcow2 format, running PostGreSQL v12., with the masters replicating to the slaves. The large ones are around 4 TB in size. There is no ProxMox HA involved, as the replication is down via PostGreSQL.
A snapshot was taken on a few of the vm's while being built and not deleted before before going into production. When we realized this oversight, though the web-gui, we tested deletion of the snapshot on one of the slaves. All was well for a few minutes, but then the vm lost ethernet connection, and the alerts started rolling in. After a few more minutes the gui issued a timeout error. Eventually, after more than 15 minutes, the machine came back online. Testing showed it was "OK" and the databse synced with the master. We unlocked the vm by issuing a pvecm unlock VMID, cleaned up the config file by deleting the snapshot entries, and confirmed the snapshot no longer existed in the qcows2 via" qemu-img -l <vm>".
This leads to where we are now: We want to delete the snapshots on the other VM's, but the masters are production and cannot come down. Has this lost ethernet connection happened to anyone else? Is this a result of the multi-TB size of the image? Is there a way to prevent this ethernet loss from happening while deleting the snapshot? Should we just live with it and not delete the snapshots? We were considering scheduling a maintenance window, powering down the VM, and deleting the snapshot. Would this be any better than a "live" deletion?
Any and all help and comments sincerely appreciated.
Thanks!