Unexpected reboot on proxmox cluster

pablomart81 · May 30, 2024

Hi everyone,

Since monday we are having unexpected reboots on all nodes of our proxmox cluster simultaneously. It has happened 3 times, once monday and twice on wednesday.

We have checked our hardware and everything seems to be ok, there's no power or network outage and temperatures are way below warning thresholds. We've contacted our hp support and they can't find any error on our hardware

Is it possible that this reboot are originated by proxmox cluster?
We're running a 6 nodes cluster on proxmox 7.4-17 using ceph 17.2.6. It's been running for over a year without any problem, but now it started failing without an apparent reasson

If you need more information or cluster logs please let me know
Ill attach logs because forum doesn't allow big messages

Chris · May 30, 2024

Hi,
from the log you provided it seems that the node fenced itself because of loss of network communication, presumably because the issues with your Ceph cluster choke the network. This is why a dedicated low latency network just for corosync is required, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

So setup a dedicated network for the Proxmox VE cluster and have a closer look at what is going on with your ceph cluster after that.

pablomart81 · May 30, 2024

Gracias por la respuesta,
Actualmente tenemos dos enlaces de 10 GB para corosync y ceph, ¿entiendo que es suficiente o necesito más ancho de banda?

pablomart81 · May 30, 2024

Este entorno lo tenemos funcionando sin problemas desde hace 2 años, el crecimiento de las VMs no ha sido lo suficientemente exponencial como para generar una carga masiva en el disco.

Hemos detectado en uno de los nodos que una parte de la interfaz de red a través de la cual se envía el tráfico ceph y corosync estaba flameando, esto puede provocar que se reinicie todo el cluster.

Chris · May 30, 2024

pablomart81 said:
Este entorno lo tenemos funcionando sin problemas desde hace 2 años, el crecimiento de las VMs no ha sido lo suficientemente exponencial como para generar una carga masiva en el disco.

Hemos detectado en uno de los nodos que una parte de la interfaz de red a través de la cual se envía el tráfico ceph y corosync estaba flameando, esto puede provocar que se reinicie todo el cluster.

This is an English speaking form, please post in English for us to help.

If I understood you correctly, you think that the link speed of 10G is enough for corosync and ceph. This is however not the point, as for corosync you need a low latency network, so separating this from the storage network is a requirement. Otherwise you will run into issues like the ones you are experiencing with nodes fencing.

pablomart81 · May 30, 2024

Thanks for the reply,
Sorry, I haven't realized how to respond in English.
We currently have two 10 GB links for corosync and ceph, do I understand that is enough or do I need more bandwidth?

We have had this environment running without problems for 2 years, the growth of VMs has not been exponential enough to generate a massive load on disk.

We have detected in one of the nodes that one part of the network interface through which ceph and corosync traffic is sent was flapping, this may cause the entire cluster to be reset.

pablomart81 · May 30, 2024

Can you tell me which part is the one that indicates that connectivity has been lost in the cluster?
to better understand the log

This is the line where it indicates that there is no connectivity in the cluster?

May 29 133644 pve01-boa ceph-osd[3055] 2024-05-29T133644.217+0200 7fdf3533e700 -1 osd.3 32970 heartbeat_check no reply from 10.0.0.36838 osd.11 since back 2024-05-29T133532.829468+0200 front 2024-05-29T133636.842610+0200 (oldest deadline 2024-05-29T133553.929373+0200)

I understand that this line has nothing to do with the cluster, what is this WARNING due to?

May 30 11:27:10 pve01-boa ceph-crash[1897]: WARNING:ceph-crash

ost /var/lib/ceph/crash/2024-01-22T13:32:38.684075Z_f06f7b02-2cb4-4c8b-a30e-2590a0a750c4 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

pablomart81 · May 30, 2024

Chris said:
This is an English speaking form, please post in English for us to help.

If I understood you correctly, you think that the link speed of 10G is enough for corosync and ceph. This is however not the point, as for corosync you need a low latency network, so separating this from the storage network is a requirement. Otherwise you will run into issues like the ones you are experiencing with nodes fencing.

there are 2 links of 10GB

According to what you indicate, I should remake the entire cluster
This is impossible without a very high service interruption

Chris · May 30, 2024

pablomart81 said:
there are 2 links of 10GB

According to what you indicate, I should remake the entire cluster
This is impossible without a very high service interruption

No, for corosync it is required to have a low latency network, bandwith is not the limiting factor. Please have a look at the link to the docs provided above. Also, you can add a second, redundant network just for corosync, without having to tear appart the cluster, please see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This way you can add the dedicated network without interrupting the cluster.

pablomart81 said:
Can you tell me which part is the one that indicates that connectivity has been lost in the cluster?
to better understand the log

The log lines prefixed with corosync[2480] and pmxcfs[2218] in your case are the one telling use about the cluster network issues, in particular May 29 133649 pve01-boa corosync[2480] [TOTEM ] Token has not been received in 4200 ms just before the reboot tells that the node could not sync up with the quorate part of the cluster.

pablomart81 · May 30, 2024

Chris said:
No, for corosync it is required to have a low latency network, bandwith is not the limiting factor. Please have a look at the link to the docs provided above. Also, you can add a second, redundant network just for corosync, without having to tear appart the cluster, please see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This way you can add the dedicated network without interrupting the cluster.

The log lines prefixed with corosync[2480] and pmxcfs[2218] in your case are the one telling use about the cluster network issues, in particular May 29 133649 pve01-boa corosync[2480] [TOTEM ] Token has not been received in 4200 ms just before the reboot tells that the node could not sync up with the quorate part of the cluster.

Thanks Chris

I will take a look at the docu to modify the corosync network

One more question,
all TOTEM notices due to corosync connectivity failures?

pablomart81 · May 30, 2024

pablomart81 said:
Thanks Chris

I will take a look at the docu to modify the corosync network

One more question,
all TOTEM notices due to corosync connectivity failures?

I have these TOTEM Messages, I understand that these are only notification messages.
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a93
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a95
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9c
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9d
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa3
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa4
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa5
May 30 10:40:42 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aac

pablomart81 · May 30, 2024

In relation to the modification of corosync, I have seen the documentation that you have provided me, it is very easy to make the change, thank you very much

pablomart81 · May 30, 2024

pablomart81 said:
I have these TOTEM Messages, I understand that these are only notification messages.
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a93
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a95
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9c
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9d
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa3
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa4
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa5
May 30 10:40:42 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aac

Flapping a network interface linked to corosync can generate this behavior, causing the entire cluster to reset????

Search

Search

Unexpected reboot on proxmox cluster

pablomart81

Active Member

Attachments

Chris

Proxmox Staff Member

pablomart81

Active Member

pablomart81

Active Member

Chris

Proxmox Staff Member

pablomart81

Active Member

pablomart81

Active Member

pablomart81

Active Member

Chris

Proxmox Staff Member

pablomart81

Active Member

pablomart81

Active Member

pablomart81

Active Member

pablomart81

Active Member

We value your privacy