VMs not automatically starting on host reboots, when relying on CephFS for cloud-config snippet storage

Jul 13, 2017
11
1
6
49
Anyone else having trouble with VMs not automatically starting on host reboots, when relying on CephFS for cloud-config snippet storage?

Code:
TASK ERROR: can't open '/mnt/pve/fs/snippets/321-k8s-worker-1-ci-user-k8s-focal.yaml' - No such file or directory

This occurs with both RancherOS and Ubuntu Focal VMs. The VMs manually boot just fine via GUI.

I have tried delaying boot by over two minutes, just in case the filesystem mount is delayed in some way
startup: up=130

This is on a current version cluster
Code:
pve-manager/6.2-10/a20769ed (running kernel: 5.4.44-2-pve)

ceph: 14.2.10-pve1
ceph-fuse: 14.2.10-pve1
corosync: 3.0.4-pve1
..
pve-qemu-kvm: 5.0.0-11

with a healthy (I think) Ceph cluster

Code:
root@pve-node1:~# ceph -s
  cluster:
    id:     e93e063c-9107-464c-95b3-d9b4bebdec36
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum node0,pve-node2,pve-node1 (age 31m)
    mgr: node0(active, since 34m), standbys: pve-node1
    mds: fs:1 {0=node0=up:active} 2 up:standby
    osd: 20 osds: 20 up (since 31m), 20 in (since 10d)
 
  data:
    pools:   3 pools, 416 pgs
    objects: 68.24k objects, 260 GiB
    usage:   718 GiB used, 34 TiB / 35 TiB avail
    pgs:     416 active+clean
 
  io:
    client:   9.3 KiB/s rd, 402 KiB/s wr, 5 op/s rd, 62 op/s wr
 
What do the logfiles tell?
 
Good question.
Syslog seems to indicate systemd[1]: Mounting /mnt/pve/fs... does not commence mounting the CephFS filesystem until well after the PVE Guests have notionally started booting.

TIA, Piers

non-snipped log outputs
/var/log/syslog --> https://gist.github.com/piersdd/f260db6bf5e2b80afd529c9b7928b3bc
/var/log/daemon.log --> https://gist.github.com/piersdd/a46bbc903eed4980b740723c1fe10b0b


/var/log/ceph/ceph-mds.node0.logs
Code:
2020-08-07 12:53:50.753 7f7ed74aa4c0  0 set uid:gid to 64045:64045 (ceph:ceph)
2020-08-07 12:53:50.753 7f7ed74aa4c0  0 ceph version 14.2.10 (aae9ebf445baf8ac7438d8dc0ed287b078820b64) nautilus (stable), process ceph-mds, pid 5513
2020-08-07 12:53:50.753 7f7ed74aa4c0  0 pidfile_write: ignore empty --pid-file
2020-08-07 12:53:50.785 7f7ed39c7700  1 mds.node0 Updating MDS map to version 997 from mon.1
2020-08-07 12:53:59.990 7f7ed39c7700  1 mds.node0 Updating MDS map to version 998 from mon.1
2020-08-07 12:53:59.990 7f7ed39c7700  1 mds.node0 Monitors have assigned me to become a standby.


snipped /var/log/daemon.log
Code:
Aug  7 12:54:03 node0 pvesh[6442]: Starting VM 332
Aug  7 12:54:03 node0 pve-guests[6465]: <root@pam> starting task UPID:node0:00001A91:00001E6E:5F2CC24B:qmstart:332:root@pam:
Aug  7 12:54:03 node0 pve-guests[6801]: start VM 332: UPID:node0:00001A91:00001E6E:5F2CC24B:qmstart:332:root@pam:
Aug  7 12:54:03 node0 pve-guests[6801]: can't open '/mnt/pve/fs/snippets/332-k8s-worker-nvidia-2-ci-user-k8s-focal-nvidia.yaml' - No such file or directory
Aug  7 12:54:03 node0 sh[3789]: Running command: /usr/sbin/ceph-volume lvm trigger 1-84e96bed-517f-422e-8ca6-2e4ab1e95c1f
Aug  7 12:54:03 node0 sh[3678]: Running command: /usr/sbin/ceph-volume lvm trigger 5-45b0d4a1-28a5-4ff7-bc2b-3b3d3d6523f4
Aug  7 12:54:03 node0 sh[3660]: Running command: /usr/sbin/ceph-volume lvm trigger 0-daca388e-3154-49f7-9744-2a5e89e8029e
Aug  7 12:54:03 node0 systemd[1]: Mounting /mnt/pve/fs...
Aug  7 12:54:03 node0 ceph-osd[5532]: 2020-08-07 12:54:03.783 7fd7af70e700 -1 osd.0 4001 set_numa_affinity unable to identify public interface 'vmbr87' numa node: (2) No such file or directory
Aug  7 12:54:03 node0 ceph-osd[5531]: 2020-08-07 12:54:03.851 7fa8843f3700 -1 osd.3 4001 set_numa_affinity unable to identify public interface 'vmbr87' numa node: (2) No such file or directory
Aug  7 12:54:03 node0 ceph-osd[5528]: 2020-08-07 12:54:03.855 7fbd678ae700 -1 osd.1 4001 set_numa_affinity unable to identify public interface 'vmbr87' numa node: (2) No such file or directory
Aug  7 12:54:03 node0 systemd[1]: Mounted /mnt/pve/fs.
Aug  7 12:54:04 node0 pve-ha-crm[6193]: status change wait_for_quorum => slave
Aug  7 12:54:04 node0 pvestatd[5884]: zfs error: cannot open 'data': no such pool#012
 
Do you see in the ceph logs, when the cephfs is starting to serve data (being mounted)? The filesystem might take some time to get ready, even after the MDS has started.
 
I would have thought the active MDS, on another node would be satisfying that requirement.

Further, I wonder why the 130 second delay is being ignored. Ignore that, I just read the manual. The delay is for subsequent VMs.
> Startup delay: Defines the interval between this VM start and subsequent VMs starts . E.g. set it to 240 if you want to wait 240 seconds before starting other VMs
-- Automatic Start and Shutdown of Virtual Machines

So perhaps I should have a simple VM start first, with a delay to hold off any VMs that require the use of CephFS snippet storage.
I guess the worry I have is how to ensure my VMs come up after a total power outage/cluster cold start.

Thanks for your help Alwin.
 
Sorry to necro this thread, but in the process of migrating my infrastructure from libvirt to proxmox I've run into this problem, and figured I'd follow up for future googlers.

Nowadays, a better workaround for this problem than the one proposed by the OP above would be to use startall-onboot-delay:

Code:
pvenode config set -startall-onboot-delay 30

But even while less ugly, blind timers are a still brittle hack. Ideally, Proxmox would more intelligently wait for CephFS to mount before starting VMs, at least those that have dependencies on CephFS mount points.

Otherwise, using cephfs for cloud-init snippets is a footgun.
 
After a bit more testing, the cephfs mount time is so variable that 30 seconds isn't consistently enough. Sometimes it's 2-3 minutes before cephfs mounts.

IMO unless you're willing to set the startall delay to something quite safe, like 5-10 minutes, you should simply just avoid using cephfs for snippets at all. This means your custom cloud-init should be written to local fs storage only and replicated to all nodes through other means (like terraform). Unfortunate.
 
Last edited:
IMO unless you're willing to set the startall delay to something quite safe, like 5-10 minutes, you should simply just avoid using cephfs for snippets at all. This means your custom cloud-init should be written to local fs storage only and replicated to all nodes through other means (like terraform). Unfortunate.
Another, still "ugly", but arguably less so, way is to write an external script that probes all required services for good health/start and then initiates a VM start. With every 5s, or so, probe you wont have to rely on random extended sleeps. We dont use Ceph, but I'd imagine that if its in place for snippets it could be in critical path for other parts of the system, so moving snippets away from it is just moving goal post.

Good luck


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!