Unexpected reboot on proxmox cluster

pablomart81

Member
Dec 9, 2020
32
1
11
43
Hi everyone,

Since monday we are having unexpected reboots on all nodes of our proxmox cluster simultaneously. It has happened 3 times, once monday and twice on wednesday.

We have checked our hardware and everything seems to be ok, there's no power or network outage and temperatures are way below warning thresholds. We've contacted our hp support and they can't find any error on our hardware

Is it possible that this reboot are originated by proxmox cluster?
We're running a 6 nodes cluster on proxmox 7.4-17 using ceph 17.2.6. It's been running for over a year without any problem, but now it started failing without an apparent reasson


If you need more information or cluster logs please let me know
Ill attach logs because forum doesn't allow big messages
 

Attachments

  • Last logs before reboot.txt
    30.2 KB · Views: 3
Last edited:
Hi,
from the log you provided it seems that the node fenced itself because of loss of network communication, presumably because the issues with your Ceph cluster choke the network. This is why a dedicated low latency network just for corosync is required, see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_cluster_network

So setup a dedicated network for the Proxmox VE cluster and have a closer look at what is going on with your ceph cluster after that.
 
  • Like
Reactions: pablomart81
Gracias por la respuesta,
Actualmente tenemos dos enlaces de 10 GB para corosync y ceph, ¿entiendo que es suficiente o necesito más ancho de banda?
 
Este entorno lo tenemos funcionando sin problemas desde hace 2 años, el crecimiento de las VMs no ha sido lo suficientemente exponencial como para generar una carga masiva en el disco.

Hemos detectado en uno de los nodos que una parte de la interfaz de red a través de la cual se envía el tráfico ceph y corosync estaba flameando, esto puede provocar que se reinicie todo el cluster.
 
Este entorno lo tenemos funcionando sin problemas desde hace 2 años, el crecimiento de las VMs no ha sido lo suficientemente exponencial como para generar una carga masiva en el disco.

Hemos detectado en uno de los nodos que una parte de la interfaz de red a través de la cual se envía el tráfico ceph y corosync estaba flameando, esto puede provocar que se reinicie todo el cluster.
This is an English speaking form, please post in English for us to help.

If I understood you correctly, you think that the link speed of 10G is enough for corosync and ceph. This is however not the point, as for corosync you need a low latency network, so separating this from the storage network is a requirement. Otherwise you will run into issues like the ones you are experiencing with nodes fencing.
 
  • Like
Reactions: pablomart81
Thanks for the reply,
Sorry, I haven't realized how to respond in English.
We currently have two 10 GB links for corosync and ceph, do I understand that is enough or do I need more bandwidth?

We have had this environment running without problems for 2 years, the growth of VMs has not been exponential enough to generate a massive load on disk.

We have detected in one of the nodes that one part of the network interface through which ceph and corosync traffic is sent was flapping, this may cause the entire cluster to be reset.
 
Can you tell me which part is the one that indicates that connectivity has been lost in the cluster?
to better understand the log

This is the line where it indicates that there is no connectivity in the cluster?
May 29 133644 pve01-boa ceph-osd[3055] 2024-05-29T133644.217+0200 7fdf3533e700 -1 osd.3 32970 heartbeat_check no reply from 10.0.0.36838 osd.11 since back 2024-05-29T133532.829468+0200 front 2024-05-29T133636.842610+0200 (oldest deadline 2024-05-29T133553.929373+0200)

I understand that this line has nothing to do with the cluster, what is this WARNING due to?

May 30 11:27:10 pve01-boa ceph-crash[1897]: WARNING:ceph-crash:post /var/lib/ceph/crash/2024-01-22T13:32:38.684075Z_f06f7b02-2cb4-4c8b-a30e-2590a0a750c4 as client.admin failed: Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')
 
This is an English speaking form, please post in English for us to help.

If I understood you correctly, you think that the link speed of 10G is enough for corosync and ceph. This is however not the point, as for corosync you need a low latency network, so separating this from the storage network is a requirement. Otherwise you will run into issues like the ones you are experiencing with nodes fencing.
there are 2 links of 10GB

According to what you indicate, I should remake the entire cluster
This is impossible without a very high service interruption
 
there are 2 links of 10GB

According to what you indicate, I should remake the entire cluster
This is impossible without a very high service interruption
No, for corosync it is required to have a low latency network, bandwith is not the limiting factor. Please have a look at the link to the docs provided above. Also, you can add a second, redundant network just for corosync, without having to tear appart the cluster, please see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This way you can add the dedicated network without interrupting the cluster.

Can you tell me which part is the one that indicates that connectivity has been lost in the cluster?
to better understand the log
The log lines prefixed with corosync[2480] and pmxcfs[2218] in your case are the one telling use about the cluster network issues, in particular May 29 133649 pve01-boa corosync[2480] [TOTEM ] Token has not been received in 4200 ms just before the reboot tells that the node could not sync up with the quorate part of the cluster.
 
  • Like
Reactions: pablomart81
No, for corosync it is required to have a low latency network, bandwith is not the limiting factor. Please have a look at the link to the docs provided above. Also, you can add a second, redundant network just for corosync, without having to tear appart the cluster, please see https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
This way you can add the dedicated network without interrupting the cluster.


The log lines prefixed with corosync[2480] and pmxcfs[2218] in your case are the one telling use about the cluster network issues, in particular May 29 133649 pve01-boa corosync[2480] [TOTEM ] Token has not been received in 4200 ms just before the reboot tells that the node could not sync up with the quorate part of the cluster.

Thanks Chris

I will take a look at the docu to modify the corosync network

One more question,
all TOTEM notices due to corosync connectivity failures?
 
Thanks Chris

I will take a look at the docu to modify the corosync network

One more question,
all TOTEM notices due to corosync connectivity failures?

I have these TOTEM Messages, I understand that these are only notification messages.
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a93
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a95
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9c
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9d
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa3
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa4
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa5
May 30 10:40:42 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aac
 
In relation to the modification of corosync, I have seen the documentation that you have provided me, it is very easy to make the change, thank you very much
 
I have these TOTEM Messages, I understand that these are only notification messages.
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a93
May 30 10:40:40 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a95
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9c
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6a9d
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa3
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa4
May 30 10:40:41 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aa5
May 30 10:40:42 pve01-boa corosync[2465]: [TOTEM ] Retransmit List: b6aac
Flapping a network interface linked to corosync can generate this behavior, causing the entire cluster to reset????
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!