Guests became unreachable during reboot of one node

jmi · Aug 2, 2024

Hello Proxmox Community!

While rebooting one node of a production Proxmox cluster all guests became unreachable. The situation went back to normal after the node was up again.

Steps performed for the reboot

Before the reboot of the node, all guests running on that node were moved to other nodes in the cluster. Yes, i am aware that this can be done automatically, but some of the services running in LXC don´t handle a restart very well.

The cluster uses Ceph as storage. The following flags were set:

* nobackfill
* nodown
* noout
* norebalance
* norecover

The node was rebooted via the GUI "reboot" button. It took some time before all proxmox services were stopped and the node finally rebooted. As soon as the node was offline the guests in the cluster were unavailable.

Additionally all nodes in the cluster were marked with a grey question mark. As it turned out, this was due to the fact that metrics are sent via HTTP to an InfluxDB running on the cluster. pvestatd blocks if it can´t reach the InfluxDB. Since this daemon provides updates about guests, hosts and storage to the GUI, it sounds reasonable that this caused the grey question marks.

As soon as the rebooted node was up again, everything was back to normal.

Proxmox VE Setup

The Proxmox VE cluster originally started as a three node cluster with a Ceph. It was extended by two nodes. These nodes do not have any OSDs. One has an additionall ceph monitor, manager and meta data server installed.

The pools are all 3/2. The rebooted node was one of the original three. In theory the remaining two copies should be enough to have a fully functional storage.

Has anybody an idea or hint what could have caused the outage?
Please let me know if i can provide any further info.

gurubert · Aug 2, 2024

How many Ceph MONs are in the cluster?

tcabernoch · Aug 2, 2024

Proxmox clusters are sketchy.
There are a few networky things you can do to help them keep standing up.

If you used the autojoin function, you might have joined on an unintended interface/IP.
Run pvecmstatus and make sure those IPs are all what you expect them to be.

NOTE: Do the reboot again, and run pvecmstatus while the issue is ongoing. That should tell you bunches.

Make sure you are NOT running Corosync on the same interface as CEPH.
At very least break em out with a VLAN so you can manage the traffic.
https://pve.proxmox.com/wiki/Separate_Cluster_Network

While you are at it, break out vmotion into its own VLAN.
Look for "Migration Settings" here.
https://pve.proxmox.com/wiki/Manual:_datacenter.cfg

Consider adding a redundant corosync interface.
https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

Its quite exciting watching your whole cluster go POOF!, isn't it?

jmi · Aug 5, 2024

Thanks for the replies.

@gurubert there are four Ceph MONs in the cluster.

@tcabernoch there are no surprises in the output of pvecm status. Corosync and ceph are separated by VLANs. The IPs listed in the membership section are all from the corosync VLAN.

My best guess is that the outage was Ceph related.

gurubert · Aug 5, 2024

jmi said:
there are four Ceph MONs in the cluster.

Do not run an even number of MONs in the cluster. If you lose two the Ceph cluster will not work any more. Deploy 3 or better 5 MONs.

Guests became unreachable during reboot of one node

jmi

New Member

gurubert

Distinguished Member

tcabernoch

Well-Known Member

jmi

New Member

gurubert

Distinguished Member

We value your privacy