Random reboot of full Proxmox Cluster

arnaudd

New Member
Aug 4, 2017
11
0
1
49
Hello,

We have 4 servers in the same cluster resetting together randomly.
They are located in 2 data centers and in 4 different racks,

We find nothing in log (except a few log in 1 or 2 servers reporting lost of nodes in the cluster).
Servers seems reset and boot normally.
This happens 3 times, 1 time last week, 2 times today.
We upgrade the cluster in January for patch Meltdown / Spectre and uptime was around 110 days until last week reboot.

Each nodes are mon and ceph clusters.

Release: 9.3
Codename: stretch
Kernel : 4.13.13-4-pve
Proxmox : 5.1-41
Total nodes in cluster : 4
Vlans : 2 for ceph and 1 for proxmox
Connectivity : 10 Gbps between each nodes

We don't saw any issue fixed in log from new version but will upgrade to 5.2 for use latest kernel.

Thanks for help
 
We have 4 servers in the same cluster resetting together randomly.
They are located in 2 data centers and in 4 different racks,
This introduces a penalty on the latency, as the servers are further apart.

We find nothing in log (except a few log in 1 or 2 servers reporting lost of nodes in the cluster).
This messages appear, when the corosync doesn't get the token through in time or at all and the node appears to be offline.

Servers seems reset and boot normally.
This happens 3 times, 1 time last week, 2 times today.
If you have HA configured, the node fences itself.

We upgrade the cluster in January for patch Meltdown / Spectre and uptime was around 110 days until last week reboot.
Hyper-converged clusters are very dynamic, the resource consumption seems to have changed.

Each nodes are mon and ceph clusters.
Four nodes allow only one node to fail, with one more the quorum is lost.

Vlans : 2 for ceph and 1 for proxmox
Connectivity : 10 Gbps between each nodes
VLANs introduce more latency and only separate the traffic logically. But it is still physically on the same media and can cause an interference of the cluster traffic. Especially if there are also network demanding service running, like ceph.

To have a stable cluster communication, the corosync traffic needs to be put onto its own physical network with a low latency. Additionally a second corosync ring can be introduced for fault tolerance. Use omping to test the stability of your corosync network and configuration.
https://pve.proxmox.com/pve-docs/chapter-pvecm.html#_cluster_network
 
  • Like
Reactions: chrone
This introduces a penalty on the latency, as the servers are further apart.

If you have HA configured, the node fences itself.

Is there a way to modify the fencing time or disable it ?

If we disable HA for all VM and keep HA the node won't fences anymore ?
Where i can find the log of this ?

Thanks a lot !
 
All logs are under /var/logs/, there is a subfolder for pve.

The reboot is like a hard reset of the server. There is nothing in log, won't be more normal and make debug more easy to at least write a line in syslog or somewhere saying HA force restart the node ?

Thanks
 
The pve-ha-lrm and pve-ha-crm write into the syslog/journal. Depending on the filesystem, writes may not get there before the machine is reset.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!