Bug Report: Rolling back a snapshot will not remove existing cloud init drives

layer7.net

Member
Oct 5, 2021
43
3
13
24
Hi,

because i dont know where to report this, i hope its ok to do it here:

with pve-manager/7.2-4/ca9d43cc

on a kvm machine. If you:

- create a snapshot
- add cloud-init drive on storage A
- restoring the snapshot

the cloud-init drive will be removed, but not the cloud-init file created on storage A.

The result will be that on this VM you can not add a cloud-init drive again on the same storage, as the code will refuse creating it because of the existing cloud-init file on the storage A is blocking it.

This does not seem to be a desired behaviour.

Greetings
Oliver
 
@layer7.net I suggest you update the bug with the exact commands and all the outputs on how to reproduce the issue. I am not seeing same results as you reported:

Code:
root@pve7demo1:~# qm config 2000
boot: c
bootdisk: scsi0
ipconfig0: ip=dhcp
memory: 512
meta: creation-qemu=6.1.1,ctime=1645719830
name: vm2000
net0: e1000=66:A3:97:EE:DE:D1,bridge=vmbr0,firewall=1
onboot: 0
scsi0: blockbridge:vm-2000-disk-0,size=112M
scsihw: virtio-scsi-pci
serial0: socket
smbios1: uuid=0feb9719-55fb-4c62-9313-d85e9e966b11
sockets: 1
vga: qxl
vmgenid: 8db540a2-d0d9-406f-ad24-df87064723c5

root@pve7demo1:~# qm snapshot 2000 snap1
snapshotting 'drive-scsi0' (blockbridge:vm-2000-disk-0)

root@pve7demo1:~# qm set 2000  --citype  nocloud
update VM 2000: -citype nocloud

root@pve7demo1:~# qm set 2000 --ide2 blockbridge:vm-2000-cloudinit,media=cdrom
update VM 2000: -ide2 blockbridge1:vm-2000-cloudinit,media=cdrom
generating cloud-init ISO

root@pve7demo1:~# qm list
      VMID NAME                 STATUS     MEM(MB)    BOOTDISK(GB) PID      
       999 VM 999               stopped    512                0.00 0        
      2000 vm2000               running    512                0.11 1345663  
     
root@pve7demo1:~# qm stop 2000
root@pve7demo1:~# qm rollback 2000 snap1
root@pve7demo1:~# qm start 2000
generating cloud-init ISO
root@pve7demo1:~#



Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,

we usually don't use the cli, as automations for proxmox goes through the API.

But as it seems a reproduceable walkthrough makes sense:

1. create through GUI a vm ( all default, just click next, select "do not use any media" in OS tab )

then things might look like this:

Code:
# qm config 120

boot: order=scsi0;ide2;net0
cores: 1
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=6.2.0,ctime=1657640276
name: test
net0: virtio=42:42:20:23:00:0C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
scsi0: ceph-hdd:vm-120-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=1208f991-992c-403c-a80b-974a3826861c
sockets: 1
vmgenid: 5a72d82b-4dd6-459f-922c-a2c5518ea1d8


2. make a snapshot ( click take snapshot, give it the name "test" )
3. click hardware -> add -> cloud init drive ( place it on any storage )

at this point things here will look like:

Code:
# qm config 120
boot: order=scsi0;ide2;net0
cores: 1
ide0: ceph-hdd:vm-120-cloudinit,media=cdrom
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=6.2.0,ctime=1657640276
name: test
net0: virtio=42:42:20:23:00:0C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: test
scsi0: ceph-hdd:vm-120-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=1208f991-992c-403c-a80b-974a3826861c
sockets: 1
vmgenid: 5a72d82b-4dd6-459f-922c-a2c5518ea1d8

while

vm-120-cloudinit
vm-120-disk-0

will be existing on the storage. So far, so good.

Now

4. roll back the snapshot

which will result in the config to look like:

Code:
boot: order=scsi0;ide2;net0
cores: 1
ide2: none,media=cdrom
memory: 2048
meta: creation-qemu=6.2.0,ctime=1657640276
name: test
net0: virtio=42:42:20:23:00:0C,bridge=vmbr0,firewall=1
numa: 0
ostype: l26
parent: test
scsi0: ceph-hdd:vm-120-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=1208f991-992c-403c-a80b-974a3826861c
sockets: 1
vmgenid: 56e4b48c-5f3d-414e-8c21-e05a3f58d6ee

as we can see, without cloudinit drive, just like it has to be.

But on the storage, we will still have:

vm-120-cloudinit
vm-120-disk-0

So the code did not tidy up the vm-120-cloudinit image.

And even worst:

5. click hardware -> add -> cloud init drive ( place it on the _same_ storage like before)

will result in:

Code:
rbd create 'vm-120-cloudinit' error: rbd: create error: (17) File exists (500)
----------------------------

The same can be reproduced on nfs backed storage, and i guess on all other storage types aswell.

There is no start of the vm involved. Its simply create -> take snapshot -> add cloud init drive -> rollback snapshot
which will not tidy up the cloud-init file on the storage. And the code refuse to overwrite existing ones.

And thats the bug. If this can not be reproduced in CLI, then its even worst if the backend treats the same functions in a different way.

Greetings
Oliver
 
Last edited:
You should add this into the bug report. This flow I can reproduce - the fact that you are using GUI is critical. I agree the result is not intuitive or expected.

However, a resolution must be very targeted - deleting a disk that is not part of snapshot on rollback can lead to a lot of grief. I.e. it could be a user disk. I've tried it with a regular second disk and it becomes "unused", not removed. But cloud-init is a special case so you dont see it in GUI.

Personally, I would not call it "even worse" - GUI always obfuscates multiple operations and hides the complexity. Here, it, likely, executes an "Allocate" first. Whereas in CLI I could just run "qm set 100 --ide2 blockbridge:vm-100-cloudinit,media=cdrom" after doing rollback in GUI and it happily accepted the change. You can achieve the same via API, I am not sure why your API tries to allocate the cloud-init disk.

A short term workaround for you is to manually remove the cloud-init disk or use CLI to re-set it.


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Hi,

yes i already put the link of the forum thread into the bug report.

And we didn't reproduce this via API. We just saw it during some testing with the GUI.

Thank you for your countercheck! And yes i agree, deleting automatically regular disks during snapshot rollback is a bad idea, while cloud-init disks are special as they are usually save to be destroyed/overwritten.

From my perspective the GUI should in general do what the CLI is doing. It's really bad if different ways ( CLI/GUI/API ) for identical actions will lead to different outcomes. This should be sync as no one will naturally expect that the same function executed in CLI or GUI or API will be handled in different ways.

Greetings
Oliver