PVE 5.4 Ceph Cloud Init Issue

Donny Davis · Apr 12, 2019

I just ran the upgrade to 5.4 and now I am running into an issue with Ceph and the cloud-init images. Every time I migrate or want to start a vm with a cloud-init image, it says it already exists and fails to start or migrate. If I go into rbd and remove the image it seems to work fine.

wolfgang · Apr 12, 2019

Hi,

this is a Bug.
It will be fixed soon.
Here is the bug tracker link to see the status.
https://bugzilla.proxmox.com/show_bug.cgi?id=2173

Donny Davis · Apr 14, 2019

Thanks for the status update @wolfgang

Craig St George · Apr 15, 2019

Good that this is known

as I just got that during an upgrade from 5.3 to 5.4 rebooted the node they moved out and failed to start because

task started by HA resource agent
2019-04-15 08:53:40.779470 7f4755195100 -1 did not load config file, using default settings.
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-ceph-internal': rbd create vm-106-cloudinit' error: rbd: create error: (17) File exists

Alan Urquhart · Apr 17, 2019

Is there any update on this bug? It was logged in bugzilla 5 days ago, but there are no further updates or estimated fix date.

This is a critical bug which renders Proxmox environments using cloudinit and Ceph as fundamentally broken.

A bug fix (whether temporary or otherwise), or a roll back to a previously working version should be provided ASAP.

Users in this configuration currently can't shutdown and restart a VM without manual intervention to remove the cloiudinit disk from RBD. High Availability (which is one reason many people will use Ceph) is also broken.

If a HA node was to crash VM's would not migrate to a live node, nor would they restart on the original node.

Additionally, unless administrators have manually encountered this bug they are likely unaware of the existence and would be unaware of the fact their HA VM's are no longer HA.

tom · Apr 17, 2019

See https://bugzilla.proxmox.com/show_bug.cgi?id=2173

Alan Urquhart · Apr 17, 2019

@tom - thanks. I can confirm the patch works

Craig St George · Apr 18, 2019

Yes also works for me

Craig St George · Apr 22, 2019

I spoke too soon It works if the HA is not turned on
If you add a VM to the HA then shutdown the VM and then issue start the error comes back

It's almost like this code exists somewhere else if the VM is in HA mode.

If you remove the VM from the HA and then restart it its ok
ANY ideas ?

aychprox · Apr 22, 2019

I manually edited Cloudinit.pm as per the patch line in https://pve.proxmox.com/pipermail/pve-devel/2019-April/036621.html. Live migration started to work fine.

But, If after the respective VM in HA shutdown, it unable to boot up again. Following error showed:

task started by HA resource agent
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-mystorage': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists

Another issue is that previously we don't need to put in domain in host "search domain" field, but now you have to put in otherwise it prompt error at line between:

my $host_resolv_conf = PVE::INotify::read_file('resolvconf');
my $searchdomains = [

Hope this can fix soon.

mira · Apr 23, 2019

I can't reproduce the issue with HA here. What are the exact steps to do so? Can you provide some information about your setup?

Craig St George · Apr 23, 2019

Interesting I have external Ceph and both the Cloud-INIT and the VM drive on on ceph

I did some more tests and still the same issue the VM is an Ubuntu one but it happens for all

It Only happens if you do the HA start .
Live or Shutdown migration works fine

So if you create a VM then add it to HA and issue a start you get that error

If you create a VM and start it then add it to HA it starts ok

The issue seems to be only with the HA start its almost like that it calls some other code that either bypass that Cloudinit.pm
or it has its own code that is simpler

mira · Apr 23, 2019

I used a hyper-converged cluster to test it. And I used HA to start and stop the VM repeatedly. Still could not reproduce it. Even migrated it inbetween to see if that makes a difference, none whatsoever.

What Ceph version are you running? Do you use krbd?

Craig St George · Apr 23, 2019

No not krbd
remote ceph using MIMIC

let me try and run some debug code and see if I can see what is happening
I m away tomorrow though and it late here

As I said the initial problem was fix with the patch as that error happened just with a normal start

mgiammarco · Apr 23, 2019

I have tried the patch and nothing changed, do I need to restart some services?

aychprox · Apr 24, 2019

mira said:
I can't reproduce the issue with HA here. What are the exact steps to do so? Can you provide some information about your setup?

Just an existing VM on the upgraded node (with latest no-subscription repo), 8 nodes ceph cluster with luminous.

proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
pve-kernel-4.15: 5.3-3
pve-kernel-4.15.18-12-pve: 4.15.18-35
pve-kernel-4.15.18-11-pve: 4.15.18-34
pve-kernel-4.15.18-10-pve: 4.15.18-32
pve-kernel-4.15.18-9-pve: 4.15.18-30
pve-kernel-4.15.18-8-pve: 4.15.18-28
pve-kernel-4.15.18-7-pve: 4.15.18-27
pve-kernel-4.15.18-4-pve: 4.15.18-23
pve-kernel-4.15.18-2-pve: 4.15.18-21
pve-kernel-4.15.18-1-pve: 4.15.18-19
pve-kernel-4.15.17-3-pve: 4.15.17-14
pve-kernel-4.15.17-1-pve: 4.15.17-9
pve-kernel-4.13.13-6-pve: 4.13.13-42
pve-kernel-4.13.13-2-pve: 4.13.13-33
ceph: 12.2.11-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-8
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-50
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-13
libpve-storage-perl: 5.0-41
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-3
lxcfs: 3.0.3-pve1
novnc-pve: 1.0.0-3
openvswitch-switch: 2.7.0-3
proxmox-widget-toolkit: 1.0-25
pve-cluster: 5.0-36
pve-container: 2.0-37
pve-docs: 5.4-2
pve-edk2-firmware: 1.20190312-1
pve-firewall: 3.0-19
pve-firmware: 2.0-6
pve-ha-manager: 2.0-9
pve-i18n: 1.1-4
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-3
pve-xtermjs: 3.12.0-1
qemu-server: 5.0-50
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.13-pve1~bpo2

Steps to reproduce:
1) shutdown the existing VM with HA using GUI.
2) Start the VM with HA using GUI

VM with HA unable to boot and following logs output via task viewer:

Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
rbd: create error: (17) File exists
TASK ERROR: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists

error at line 94 can be resolved by adding a DNS domain search value, but VM still unable to boot due to error (17)

Tested:
Remove HA - still unable to boot, similar log output showed.
Remove cloud-init drive or put cloud-init drive on NFS/local disk, reboot, OK

so I believed that this only happened when put cloud-init drive on ceph

tailf /var/log/daemon.log

Apr 24 16:45:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.
Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: start VM 300: UPID:node1.local.host:001CDA7F:010F622F:5CC02231:qmstart:300:root@pam:
Apr 24 16:45:37 rk1-4u2-106C6a pvedaemon[1890943]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists
Apr 24 16:46:00 node1.local.host systemd[1]: Starting Proxmox VE replication runner...
Apr 24 16:46:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.

Craig St George · Apr 24, 2019

yes
systemctl restart pvedaemon

mgiammarco said:
I have tried the patch and nothing changed, do I need to restart some services?

Craig St George · Apr 25, 2019

Interesting I did an update to the nodes from the pve-no-subscription
and now it works as expected with the patch

Donny Davis · Apr 26, 2019

Yea I am still getting the same error. My machines are pacemaker managed. Additionally I updated and rebooted every node in the cluster

mlanner · Apr 29, 2019

@mira When do you expect this patch to released in the pve-no-subscription repo for testing?

PVE 5.4 Ceph Cloud Init Issue

Member

Proxmox Retired Staff

Member

Well-Known Member

New Member

Proxmox Staff Member

New Member

Well-Known Member

Well-Known Member

Renowned Member

Proxmox Staff Member

Well-Known Member

Proxmox Staff Member

Well-Known Member

Renowned Member

Renowned Member

Well-Known Member

Well-Known Member

Member

Renowned Member