Hello Proxmox Community!
While rebooting one node of a production Proxmox cluster all guests became unreachable. The situation went back to normal after the node was up again.
Steps performed for the reboot
Before the reboot of the node, all guests running on that node were moved to other nodes in the cluster. Yes, i am aware that this can be done automatically, but some of the services running in LXC don´t handle a restart very well.
The cluster uses Ceph as storage. The following flags were set:
* nobackfill
* nodown
* noout
* norebalance
* norecover
The node was rebooted via the GUI "reboot" button. It took some time before all proxmox services were stopped and the node finally rebooted. As soon as the node was offline the guests in the cluster were unavailable.
Additionally all nodes in the cluster were marked with a grey question mark. As it turned out, this was due to the fact that metrics are sent via HTTP to an InfluxDB running on the cluster. pvestatd blocks if it can´t reach the InfluxDB. Since this daemon provides updates about guests, hosts and storage to the GUI, it sounds reasonable that this caused the grey question marks.
As soon as the rebooted node was up again, everything was back to normal.
Proxmox VE Setup
The Proxmox VE cluster originally started as a three node cluster with a Ceph. It was extended by two nodes. These nodes do not have any OSDs. One has an additionall ceph monitor, manager and meta data server installed.
The pools are all 3/2. The rebooted node was one of the original three. In theory the remaining two copies should be enough to have a fully functional storage.
Has anybody an idea or hint what could have caused the outage?
Please let me know if i can provide any further info.
While rebooting one node of a production Proxmox cluster all guests became unreachable. The situation went back to normal after the node was up again.
Steps performed for the reboot
Before the reboot of the node, all guests running on that node were moved to other nodes in the cluster. Yes, i am aware that this can be done automatically, but some of the services running in LXC don´t handle a restart very well.
The cluster uses Ceph as storage. The following flags were set:
* nobackfill
* nodown
* noout
* norebalance
* norecover
The node was rebooted via the GUI "reboot" button. It took some time before all proxmox services were stopped and the node finally rebooted. As soon as the node was offline the guests in the cluster were unavailable.
Additionally all nodes in the cluster were marked with a grey question mark. As it turned out, this was due to the fact that metrics are sent via HTTP to an InfluxDB running on the cluster. pvestatd blocks if it can´t reach the InfluxDB. Since this daemon provides updates about guests, hosts and storage to the GUI, it sounds reasonable that this caused the grey question marks.
As soon as the rebooted node was up again, everything was back to normal.
Proxmox VE Setup
The Proxmox VE cluster originally started as a three node cluster with a Ceph. It was extended by two nodes. These nodes do not have any OSDs. One has an additionall ceph monitor, manager and meta data server installed.
The pools are all 3/2. The rebooted node was one of the original three. In theory the remaining two copies should be enough to have a fully functional storage.
Has anybody an idea or hint what could have caused the outage?
Please let me know if i can provide any further info.