PVE 5.4 Ceph Cloud Init Issue

Discussion in 'Proxmox VE: Installation and configuration' started by Donny Davis, Apr 12, 2019.

  1. Donny Davis

    Donny Davis New Member

    Joined:
    Apr 11, 2019
    Messages:
    4
    Likes Received:
    0
    I just ran the upgrade to 5.4 and now I am running into an issue with Ceph and the cloud-init images. Every time I migrate or want to start a vm with a cloud-init image, it says it already exists and fails to start or migrate. If I go into rbd and remove the image it seems to work fine.
     
  2. wolfgang

    wolfgang Proxmox Staff Member
    Staff Member

    Joined:
    Oct 1, 2014
    Messages:
    4,683
    Likes Received:
    309
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    HE_Cole, shantanu and Donny Davis like this.
  3. Donny Davis

    Donny Davis New Member

    Joined:
    Apr 11, 2019
    Messages:
    4
    Likes Received:
    0
  4. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    Good that this is known :) as I just got that during an upgrade from 5.3 to 5.4 rebooted the node they moved out and failed to start because

    task started by HA resource agent
    2019-04-15 08:53:40.779470 7f4755195100 -1 did not load config file, using default settings.
    rbd: create error: (17) File exists
    TASK ERROR: error with cfs lock 'storage-ceph-internal': rbd create vm-106-cloudinit' error: rbd: create error: (17) File exists
     
  5. Alan Urquhart

    Alan Urquhart New Member

    Joined:
    Apr 17, 2019
    Messages:
    2
    Likes Received:
    0
    Is there any update on this bug? It was logged in bugzilla 5 days ago, but there are no further updates or estimated fix date.

    This is a critical bug which renders Proxmox environments using cloudinit and Ceph as fundamentally broken.

    A bug fix (whether temporary or otherwise), or a roll back to a previously working version should be provided ASAP.

    Users in this configuration currently can't shutdown and restart a VM without manual intervention to remove the cloiudinit disk from RBD. High Availability (which is one reason many people will use Ceph) is also broken.

    If a HA node was to crash VM's would not migrate to a live node, nor would they restart on the original node.

    Additionally, unless administrators have manually encountered this bug they are likely unaware of the existence and would be unaware of the fact their HA VM's are no longer HA.
     
  6. tom

    tom Proxmox Staff Member
    Staff Member

    Joined:
    Aug 29, 2006
    Messages:
    13,460
    Likes Received:
    393
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
    HE_Cole likes this.
  7. Alan Urquhart

    Alan Urquhart New Member

    Joined:
    Apr 17, 2019
    Messages:
    2
    Likes Received:
    0
    @tom - thanks. I can confirm the patch works
     
  8. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    Yes also works for me
     
  9. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    I spoke too soon It works if the HA is not turned on
    If you add a VM to the HA then shutdown the VM and then issue start the error comes back

    It's almost like this code exists somewhere else if the VM is in HA mode.

    If you remove the VM from the HA and then restart it its ok
    ANY ideas ?
     
  10. aychprox

    aychprox Member

    Joined:
    Oct 27, 2015
    Messages:
    61
    Likes Received:
    4
    I manually edited Cloudinit.pm as per the patch line in https://pve.proxmox.com/pipermail/pve-devel/2019-April/036621.html. Live migration started to work fine.

    But, If after the respective VM in HA shutdown, it unable to boot up again. Following error showed:

    task started by HA resource agent
    rbd: create error: (17) File exists
    TASK ERROR: error with cfs lock 'storage-mystorage': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists


    Another issue is that previously we don't need to put in domain in host "search domain" field, but now you have to put in otherwise it prompt error at line between:

    my $host_resolv_conf = PVE::INotify::read_file('resolvconf');
    my $searchdomains = [


    Hope this can fix soon.
     
  11. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    137
    Likes Received:
    10
    I can't reproduce the issue with HA here. What are the exact steps to do so? Can you provide some information about your setup?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  12. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    Interesting I have external Ceph and both the Cloud-INIT and the VM drive on on ceph

    I did some more tests and still the same issue the VM is an Ubuntu one but it happens for all

    It Only happens if you do the HA start .
    Live or Shutdown migration works fine

    So if you create a VM then add it to HA and issue a start you get that error

    If you create a VM and start it then add it to HA it starts ok


    The issue seems to be only with the HA start its almost like that it calls some other code that either bypass that Cloudinit.pm
    or it has its own code that is simpler
     
  13. mira

    mira Proxmox Staff Member
    Staff Member

    Joined:
    Feb 25, 2019
    Messages:
    137
    Likes Received:
    10
    I used a hyper-converged cluster to test it. And I used HA to start and stop the VM repeatedly. Still could not reproduce it. Even migrated it inbetween to see if that makes a difference, none whatsoever.

    What Ceph version are you running? Do you use krbd?
     
    Stop hovering to collapse... Click to collapse... Hover to expand... Click to expand...
  14. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    No not krbd
    remote ceph using MIMIC

    let me try and run some debug code and see if I can see what is happening
    I m away tomorrow though and it late here

    As I said the initial problem was fix with the patch as that error happened just with a normal start
     
  15. mgiammarco

    mgiammarco Member

    Joined:
    Feb 18, 2010
    Messages:
    73
    Likes Received:
    0
    I have tried the patch and nothing changed, do I need to restart some services?
     
  16. aychprox

    aychprox Member

    Joined:
    Oct 27, 2015
    Messages:
    61
    Likes Received:
    4
    Just an existing VM on the upgraded node (with latest no-subscription repo), 8 nodes ceph cluster with luminous.

    proxmox-ve: 5.4-1 (running kernel: 4.15.18-12-pve)
    pve-manager: 5.4-3 (running version: 5.4-3/0a6eaa62)
    pve-kernel-4.15: 5.3-3
    pve-kernel-4.15.18-12-pve: 4.15.18-35
    pve-kernel-4.15.18-11-pve: 4.15.18-34
    pve-kernel-4.15.18-10-pve: 4.15.18-32
    pve-kernel-4.15.18-9-pve: 4.15.18-30
    pve-kernel-4.15.18-8-pve: 4.15.18-28
    pve-kernel-4.15.18-7-pve: 4.15.18-27
    pve-kernel-4.15.18-4-pve: 4.15.18-23
    pve-kernel-4.15.18-2-pve: 4.15.18-21
    pve-kernel-4.15.18-1-pve: 4.15.18-19
    pve-kernel-4.15.17-3-pve: 4.15.17-14
    pve-kernel-4.15.17-1-pve: 4.15.17-9
    pve-kernel-4.13.13-6-pve: 4.13.13-42
    pve-kernel-4.13.13-2-pve: 4.13.13-33
    ceph: 12.2.11-pve1
    corosync: 2.4.4-pve1
    criu: 2.11.1-1~bpo90
    glusterfs-client: 3.8.8-1
    ksm-control-daemon: 1.2-2
    libjs-extjs: 6.0.1-2
    libpve-access-control: 5.1-8
    libpve-apiclient-perl: 2.0-5
    libpve-common-perl: 5.0-50
    libpve-guest-common-perl: 2.0-20
    libpve-http-server-perl: 2.0-13
    libpve-storage-perl: 5.0-41
    libqb0: 1.0.3-1~bpo9
    lvm2: 2.02.168-pve6
    lxc-pve: 3.1.0-3
    lxcfs: 3.0.3-pve1
    novnc-pve: 1.0.0-3
    openvswitch-switch: 2.7.0-3
    proxmox-widget-toolkit: 1.0-25
    pve-cluster: 5.0-36
    pve-container: 2.0-37
    pve-docs: 5.4-2
    pve-edk2-firmware: 1.20190312-1
    pve-firewall: 3.0-19
    pve-firmware: 2.0-6
    pve-ha-manager: 2.0-9
    pve-i18n: 1.1-4
    pve-libspice-server1: 0.14.1-2
    pve-qemu-kvm: 2.12.1-3
    pve-xtermjs: 3.12.0-1
    qemu-server: 5.0-50
    smartmontools: 6.5+svn4324-1
    spiceterm: 3.0-5
    vncterm: 1.5-3
    zfsutils-linux: 0.7.13-pve1~bpo2

    Steps to reproduce:
    1) shutdown the existing VM with HA using GUI.
    2) Start the VM with HA using GUI

    VM with HA unable to boot and following logs output via task viewer:

    Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
    rbd: create error: (17) File exists
    TASK ERROR: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists


    error at line 94 can be resolved by adding a DNS domain search value, but VM still unable to boot due to error (17)

    Tested:
    Remove HA - still unable to boot, similar log output showed.
    Remove cloud-init drive or put cloud-init drive on NFS/local disk, reboot, OK

    so I believed that this only happened when put cloud-init drive on ceph

    tailf /var/log/daemon.log

    Apr 24 16:45:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.
    Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: start VM 300: UPID:node1.local.host:001CDA7F:010F622F:5CC02231:qmstart:300:root@pam:
    Apr 24 16:45:37 rk1-4u2-106C6a pvedaemon[1890943]: Use of uninitialized value in split at /usr/share/perl5/PVE/QemuServer/Cloudinit.pm line 94.
    Apr 24 16:45:37 node1.local.host pvedaemon[1890943]: error with cfs lock 'storage-2': rbd create vm-300-cloudinit' error: rbd: create error: (17) File exists
    Apr 24 16:46:00 node1.local.host systemd[1]: Starting Proxmox VE replication runner...
    Apr 24 16:46:01 node1.local.host systemd[1]: Started Proxmox VE replication runner.
     
    #16 aychprox, Apr 24, 2019
    Last edited: Apr 24, 2019
  17. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    yes
    systemctl restart pvedaemon


     
  18. Craig St George

    Joined:
    Jul 31, 2018
    Messages:
    73
    Likes Received:
    7
    Interesting I did an update to the nodes from the pve-no-subscription
    and now it works as expected with the patch
     
  19. Donny Davis

    Donny Davis New Member

    Joined:
    Apr 11, 2019
    Messages:
    4
    Likes Received:
    0
    Yea I am still getting the same error. My machines are pacemaker managed. Additionally I updated and rebooted every node in the cluster
     
  20. mlanner

    mlanner Member

    Joined:
    Apr 1, 2009
    Messages:
    184
    Likes Received:
    1
    @mira When do you expect this patch to released in the pve-no-subscription repo for testing?
     
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice