"Start at boot" and shared storage

jdl · Mar 3, 2016

Hello,
Often, some VMs which have the "start at boot" flag enabled don't start at boot.
They're located on shared storage (iSCSI, NFS or Ceph).
I thought the "startup delay" parameter could be useful, but, according to the documentation : https://pve.proxmox.com/wiki/Virtual_Machine_Startup_and_Shutdown_Behavior , it's not the purpose of this parameter.
The VMs don't start probably because networking took too much time to initialize, thus delaying the shared storage and so on.
The only workaround i found was to create a tiny local container, set "start at boot" to on, and set "startup delay" to (say) 30s, so it leaves enough time for required services to initialize, and then the shared storage VM boot. The sole purpose of this container is to be a timer, it's not a clean solution.
Obviously there should be a better way.
Any idea ?
Thanks

t.lamprecht · Mar 4, 2016

jdl said:
The VMs don't start probably because networking took too much time to initialize, thus delaying the shared storage and so on.

The startall at boot gets triggered by pve-manager which should only start after the network stack is fully established and configured, at least on PVE 4.X this is the case, 3.X should be also but have to look.

On my test cluster ceph is well up at that time and my machines are able to start, I have to note that it is a ceph on PVE setup.

Whats your pveversion -v output? (if < 4.1 please update

)

Can you post an excerpt from the log at boot, at best with the network/storage initialization part and the startall command included-

jdl · Mar 4, 2016

I'm using PVE 4.1.13, here's the output of pveversion :

proxmox-ve: 4.1-37 (running kernel: 4.2.8-1-pve)
pve-manager: 4.1-13 (running version: 4.1-13/cfb599fb)
pve-kernel-4.2.8-1-pve: 4.2.8-37
lvm2: 2.02.116-pve2
corosync-pve: 2.3.5-2
libqb0: 1.0-1
pve-cluster: 4.0-32
qemu-server: 4.0-55
pve-firmware: 1.1-7
libpve-common-perl: 4.0-48
libpve-access-control: 4.0-11
libpve-storage-perl: 4.0-40
pve-libspice-server1: 0.12.5-2
vncterm: 1.2-1
pve-qemu-kvm: 2.5-5
pve-container: 1.0-44
pve-firewall: 2.0-17
pve-ha-manager: 1.0-21
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u1
lxc-pve: 1.1.5-7
lxcfs: 0.13-pve3
cgmanager: 0.39-pve1
criu: 1.6.0-1
openvswitch-switch: 2.3.2-2

Is it the output of journalctl you need or another log ?

jdl · Mar 4, 2016

Here's attached the log filtered (i greped 3 lines around the matched keywords) :

The first "startall" message appears after the eth0 ready message, but before the eth1 ready message, and the nic used for the iSCSI is eth1.

t.lamprecht · Mar 7, 2016

Hmm ok, yeah that would be a problem here, I'll give it a look. If possible we want to keep the current behaviour, with the argument that this is useful if a VM provides services for other VMs and thus the other need to be started later on, we do not want to break compatibility here if possible. I'll look how the systemd networking unit acts with such cases.

Until then continue to use the small CT hack, even if it's not the nicest one.

jdl · Mar 7, 2016

Copy that

Thank you

athompso · May 2, 2016

I assume the "small CT hack" means create a tiny (basically no-op) LXC (or whatever it's called in the current version) container on each host, set it to boot priority=1, and use that to determine the timings for the rest of the boot?

If so, I'm thinking I might be better off disabling auto-start on all the VMs in the cluster, and writing a service that runs on a node and controls start-up inter-dependencies of each VM.

My problem is similar, in that I have an 8-node PVE/CEPH cluster but some of the systems take noticeably longer to boot than others, and two of them frequently fail to start their OSDs correctly at boot (timeouts, I think), so bringing my cluster up "cold" is a royal pain in the butt. And then I get the IOPS boot storm as four domain controllers (two each for two domains) try to boot immediately while CEPH is still trying to rebalance... it's not pretty.

t.lamprecht · May 3, 2016

athompso said:
I assume the "small CT hack" means create a tiny (basically no-op) LXC (or whatever it's called in the current version) container on each host, set it to boot priority=1, and use that to determine the timings for the rest of the boot?

Exactly, it can be also another VM which has no boot priority else (and in your case isn't on ceph). Regarding the "or whatever it's called in the current version" comment, we switched CT technology once in 8 Years and the reason was that OpenVZ dev was halted, so while I understand that this problem is annoying to you its no need to be snarky

athompso said:
If so, I'm thinking I might be better off disabling auto-start on all the VMs in the cluster, and writing a service that runs on a node and controls start-up inter-dependencies of each VM.

I could imagine adding a "pre start delay" setting to the start up order schema declaration. While normally we argue that the manager should only exec the startall command if all services are up and running, this cannot always be guaranteed, as in "running service" != "ready service". That would keep backwards compatibility, its non-invasive and solve that once and for all. Have you any thoughts to that?

athompso said:
My problem is similar, in that I have an 8-node PVE/CEPH cluster but some of the systems take noticeably longer to boot than others, and two of them frequently fail to start their OSDs correctly at boot (timeouts, I think), so bringing my cluster up "cold" is a royal pain in the butt. And then I get the IOPS boot storm as four domain controllers (two each for two domains) try to boot immediately while CEPH is still trying to rebalance... it's not pretty.

Could you please post a log from boot until the startall task from pve manager? Maybe something is off, would be interesting to see what when happens.

Code:

journalctl -b > log.out

If to big use pastebin or similar, thanks.

athompso · May 3, 2016

I would say the biggest problem is the CEPH re-re-balancing after the last couple of nodes finish booting, because that consumes nearly 100% of the IOPS each node can deliver. (Four of the nodes have strangely slow disk despite being 10k SAS drives with an SSD cache... something to do with the Linux driver for the HBA.)
If CEPH was quiescent, it could probably handle the boot-storm IOPS load.

I won't have the opportunity to do a complete cold-start anytime soon, so I'm going to let this thread die here and I'll start a new thread the next time one of my OSDs fails to start.

chrone · Jul 17, 2016

t.lamprecht said:
The startall at boot gets triggered by pve-manager which should only start after the network stack is fully established and configured, at least on PVE 4.X this is the case, 3.X should be also but have to look.

On my test cluster ceph is well up at that time and my machines are able to start, I have to note that it is a ceph on PVE setup.

Whats your pveversion -v output? (if < 4.1 please update )

Can you post an excerpt from the log at boot, at best with the network/storage initialization part and the startall command included-

Could you guys improve the Start All VMs mechanism to only start when quorum is reached? I often find VMs not start at boot due to quorum is not reached.

t.lamprecht · Jul 18, 2016

chrone said:
Could you guys improve the Start All VMs mechanism to only start when quorum is reached? I often find VMs not start at boot due to quorum is not reached.

This gets already checked? We wait for up to 60 seconds after the pve-manager service gets started, it gets started quite late in the boot cycle, on my system its actually the last one so quorum should be there almost instantly?
If then no quorum is reached, or quorum is lost during starting the VMs, the startall command is aborted.

Previous to PVE 4.X we wait a little shorter for quorum so if you use that and your system needs that long to start up then you may run into problems...

Whats your setup or why may it happen that you achieve quorum so late? I mean I know boot time can take long on servers but at the point where the start all command gets triggered the heavy part already happened, so a minute is here was chosen to be a good upper limit

chrone · Jul 19, 2016

t.lamprecht said:
This gets already checked? We wait for up to 60 seconds after the pve-manager service gets started, it gets started quite late in the boot cycle, on my system its actually the last one so quorum should be there almost instantly?
If then no quorum is reached, or quorum is lost during starting the VMs, the startall command is aborted.

Previous to PVE 4.X we wait a little shorter for quorum so if you use that and your system needs that long to start up then you may run into problems...

Whats your setup or why may it happen that you achieve quorum so late? I mean I know boot time can take long on servers but at the point where the start all command gets triggered the heavy part already happened, so a minute is here was chosen to be a good upper limit

We used 4x 10GbE in LACP and VLAN and it takes 3 minutes to reach quorum, perhaps the network bonding and VLAN takes more time to be available even though all the bonding, vlan, and bridge interfaces are already up and running.

I noticed from the systemd pve-manager.service uptime was already up for 2 minutes and it failed to boot VMs marked to start at boot due to no quorum. Perhaps it takes 3 minutes for us to reach quorum. I ended up using cron job at reboot to wait for 5 minutes and restart pve-manager services did the trick. Could Proxmox restart pve-manager service once the node reaches quorum?

@reboot sleep 300; systemctl restart pve-manager.service &>/dev/null

We setup 3 VMs in HA spread across 3 nodes either on GlusterFS or Ceph RBD, perhaps why a quorum is needed to boot VMs as I recall we didn't have this issue on last May without any HA VMs set.

We also tried to group a VM only to specific node in HA, and Proxmox kept booting the VM to other available nodes in the cluster should this one node was out of sync with the cluster. I ended up relaxing the corosync.conf totem token timeout to avoid false alarm and Proxmox rebooting a working node.

token: 10000
token_retransmits_before_loss_const: 10
consensus: 12000

Reference:
http://docs.openstack.org/ha-guide/controller-ha-pacemaker.html

Pablohn · Sep 14, 2016

And what about a start when conditional? That leave freedom to the admin to start a container/VM when something has happened (not a boot order that cannot deal with time events).

For example I am having this problem because my NFS storage is not ready (and the time of a fsck is not known), so providing a conditional depending on one check script, autostart begins when check script returns that nfs storage is ready. The same for other storage types or network services. I am not sure if this idea could be filed as feature request.

fabian · Sep 15, 2016

Pablohn said:
And what about a start when conditional? That leave freedom to the admin to start a container/VM when something has happened (not a boot order that cannot deal with time events).

For example I am having this problem because my NFS storage is not ready (and the time of a fsck is not known), so providing a conditional depending on one check script, autostart begins when check script returns that nfs storage is ready. The same for other storage types or network services. I am not sure if this idea could be filed as feature request.

you can already do this easily by creating your own systemd units (checking for mounts, paths, other services, ... and running arbitrary commands like "pct start" or "qm start")

Pablohn · Sep 15, 2016

fabian said:
you can already do this easily by creating your own systemd units (checking for mounts, paths, other services, ... and running arbitrary commands like "pct start" or "qm start")

Could you explain a little bit more how to deal with systemd units and proxmox CT/VM? Is there any documentation about that?

fabian · Sep 16, 2016

Pablohn said:
Could you explain a little bit more how to deal with systemd units and proxmox CT/VM? Is there any documentation about that?

no there isn't (at least not by proxmox).

systemd allows you to define arbitrary units that depend on various conditions, like mount points being mounted, time conditions, started services, ... such units can run any command you like, including "pct start" and "qm start" to start containers or VMs. check the man pages for systemd.* for the extensive documentation on how to configure units.

Search

Search

"Start at boot" and shared storage

jdl

New Member

t.lamprecht

Proxmox Staff Member

jdl

New Member

jdl

New Member

Attachments

t.lamprecht

Proxmox Staff Member

jdl

New Member

athompso

Renowned Member

t.lamprecht

Proxmox Staff Member

athompso

Renowned Member

chrone

Renowned Member

t.lamprecht

Proxmox Staff Member

chrone

Renowned Member

Pablohn

New Member

fabian

Proxmox Staff Member

Pablohn

New Member

fabian

Proxmox Staff Member

We value your privacy