Guests became unreachable during reboot of one node

jmi

New Member
Jul 31, 2024
3
1
3
Hello Proxmox Community!

While rebooting one node of a production Proxmox cluster all guests became unreachable. The situation went back to normal after the node was up again.


Steps performed for the reboot

Before the reboot of the node, all guests running on that node were moved to other nodes in the cluster. Yes, i am aware that this can be done automatically, but some of the services running in LXC don´t handle a restart very well.

The cluster uses Ceph as storage. The following flags were set:

* nobackfill
* nodown
* noout
* norebalance
* norecover

The node was rebooted via the GUI "reboot" button. It took some time before all proxmox services were stopped and the node finally rebooted. As soon as the node was offline the guests in the cluster were unavailable.

Additionally all nodes in the cluster were marked with a grey question mark. As it turned out, this was due to the fact that metrics are sent via HTTP to an InfluxDB running on the cluster. pvestatd blocks if it can´t reach the InfluxDB. Since this daemon provides updates about guests, hosts and storage to the GUI, it sounds reasonable that this caused the grey question marks.

As soon as the rebooted node was up again, everything was back to normal.


Proxmox VE Setup

The Proxmox VE cluster originally started as a three node cluster with a Ceph. It was extended by two nodes. These nodes do not have any OSDs. One has an additionall ceph monitor, manager and meta data server installed.

The pools are all 3/2. The rebooted node was one of the original three. In theory the remaining two copies should be enough to have a fully functional storage.


Has anybody an idea or hint what could have caused the outage?
Please let me know if i can provide any further info.
 
Proxmox clusters are sketchy.
There are a few networky things you can do to help them keep standing up.

If you used the autojoin function, you might have joined on an unintended interface/IP.
Run pvecmstatus and make sure those IPs are all what you expect them to be.

NOTE: Do the reboot again, and run pvecmstatus while the issue is ongoing. That should tell you bunches.

Make sure you are NOT running Corosync on the same interface as CEPH.
At very least break em out with a VLAN so you can manage the traffic.
https://pve.proxmox.com/wiki/Separate_Cluster_Network

While you are at it, break out vmotion into its own VLAN.
Look for "Migration Settings" here.
https://pve.proxmox.com/wiki/Manual:_datacenter.cfg

Consider adding a redundant corosync interface.
https://pve.proxmox.com/wiki/Cluster_Manager#pvecm_redundancy

Its quite exciting watching your whole cluster go POOF!, isn't it?
 
Last edited:
Thanks for the replies.

@gurubert there are four Ceph MONs in the cluster.

@tcabernoch there are no surprises in the output of pvecm status. Corosync and ceph are separated by VLANs. The IPs listed in the membership section are all from the corosync VLAN.

My best guess is that the outage was Ceph related.
 
  • Like
Reactions: tcabernoch

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!