PVE 5.4 Ceph Cloud Init Issue

Donny Davis

Member
Apr 11, 2019
6
0
6
41
I just ran the upgrade to 5.4 and now I am running into an issue with Ceph and the cloud-init images. Every time I migrate or want to start a vm with a cloud-init image, it says it already exists and fails to start or migrate. If I go into rbd and remove the image it seems to work fine.
 
Good that this is known :) as I just got that during an upgrade from 5.3 to 5.4 rebooted the node they moved out and failed to start because

task started by HA resource agent
2019-04-15 08:53:40.779470 7f4755195100 -1 did not load config file, using default settings.
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-ceph-internal': rbd create vm-106-cloudinit' error: rbd: create error: (17) File exists
 
Is there any update on this bug? It was logged in bugzilla 5 days ago, but there are no further updates or estimated fix date.

This is a critical bug which renders Proxmox environments using cloudinit and Ceph as fundamentally broken.

A bug fix (whether temporary or otherwise), or a roll back to a previously working version should be provided ASAP.

Users in this configuration currently can't shutdown and restart a VM without manual intervention to remove the cloiudinit disk from RBD. High Availability (which is one reason many people will use Ceph) is also broken.

If a HA node was to crash VM's would not migrate to a live node, nor would they restart on the original node.

Additionally, unless administrators have manually encountered this bug they are likely unaware of the existence and would be unaware of the fact their HA VM's are no longer HA.
 
I spoke too soon It works if the HA is not turned on
If you add a VM to the HA then shutdown the VM and then issue start the error comes back

It's almost like this code exists somewhere else if the VM is in HA mode.

If you remove the VM from the HA and then restart it its ok
ANY ideas ?
 
I manually edited Cloudinit.pm as per the patch line in https://pve.proxmox.com/pipermail/pve-devel/2019-April/036621.html. Live migration started to work fine.

But, If after the respective VM in HA shutdown, it unable to boot up again. Following error showed:

task started by HA resource agent
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-mystorage': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists


Another issue is that previously we don't need to put in domain in host "search domain" field, but now you have to put in otherwise it prompt error at line between:

my $host_resolv_conf = PVE::INotify::read_file('resolvconf');
my $searchdomains = [


Hope this can fix soon.
 
I can't reproduce the issue with HA here. What are the exact steps to do so? Can you provide some information about your setup?
 
Interesting I have external Ceph and both the Cloud-INIT and the VM drive on on ceph

I did some more tests and still the same issue the VM is an Ubuntu one but it happens for all

It Only happens if you do the HA start .
Live or Shutdown migration works fine

So if you create a VM then add it to HA and issue a start you get that error

If you create a VM and start it then add it to HA it starts ok


The issue seems to be only with the HA start its almost like that it calls some other code that either bypass that Cloudinit.pm
or it has its own code that is simpler
 
I used a hyper-converged cluster to test it. And I used HA to start and stop the VM repeatedly. Still could not reproduce it. Even migrated it inbetween to see if that makes a difference, none whatsoever.

What Ceph version are you running? Do you use krbd?
 
No not krbd
remote ceph using MIMIC

let me try and run some debug code and see if I can see what is happening
I m away tomorrow though and it late here

As I said the initial problem was fix with the patch as that error happened just with a normal start
 
I can't reproduce the issue with HA here. What are the exact steps to do so? Can you provide some information about your setup?

Just an existing VM on the upgraded node (with latest no-subscription repo), 8 nodes ceph cluster with luminous.

proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.11-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Steps to reproduce:
1) shutdown the existing VM with HA using GUI.
2) Start the VM with HA using GUI

VM with HA unable to boot and following logs output via task viewer:

Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists


error at line 94 can be resolved by adding a DNS domain search value, but VM still unable to boot due to error (17)

Tested:
Remove HA - still unable to boot, similar log output showed.
Remove cloud-init drive or put cloud-init drive on NFS/local disk, reboot, OK

so I believed that this only happened when put cloud-init drive on ceph

tailf /var/log/daemon.log

Apr 24 16:45:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.
Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: start VM 300: UPID:node1.local.host:001CDA7F:010F622F:5CC02231:qmstart:300:root@pam:
Apr 24 16:45:37 rk1-4u2-106C6a pvedaemon[1890943]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists
Apr 24 16:46:00 node1.local.host systemd[1]: Starting Proxmox VE replication runner...
Apr 24 16:46:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.
 
Last edited:
Interesting I did an update to the nodes from the pve-no-subscription
and now it works as expected with the patch
 
Yea I am still getting the same error. My machines are pacemaker managed. Additionally I updated and rebooted every node in the cluster
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!