mons down: reboot of the physical node (Proxmox)

Sep 12, 2021
3
0
6
48
I would like to ask you for help because I am running out of ideas on how to solve our issue.
A couple of months ago (just after updating from Proxmox 7.2 to 7.3), we began to receive messages from our Ceph system indicating that there was a "health warn".

The messages we receive are like the following (not always the same monitor):

Code:
[WARN] MON_DOWN: 1/5 mons down, quorum vrt-05,vrt-08,vrt-02,vrt-09
mon.vrt-01 (rank 4) addr [v2:10.61.12.201:3300/0,v1:10.61.12.201:6789/0] is down (out of quorum)

Our cluster recovers usually after a few minutes and when we launch the command "ceph -s" everything is correct and the Proxmox system works fine.
But sometimes after we receive messages like those, we have experimented a reboot of the physical node (Proxmox). This only happens with nodes that are running the monitor service.

Our Hyper-Converged Proxmox cluster is made up of 12 physical nodes (48 Cores / 384GB RAM each node).
In every physical machine we have 4 OSD (Intel D3-S4610 3.84TB).
Five nodes are also monitors.
Our Ceph is currently at 62% of capacity
Nodes are currently using 10-20% cpu and at 40-50% RAM
All OSD's are currently under 10ms of "Apply/Commit Latency"

Note:
Proxmox PVE: 7.3-3 (enterprise repo)
Ceph: 16.2.9
Kernel: 5.15.74-1-pve

Any help will be appreciated
 
Last edited:
It could be that the software watchdog of Proxmox's corosync cluster is the culprit here. When it loses the connection to the other nodes it may reboot itself. Does the corosync link (or links) share traffic with other applications?
 
Thanks gurubert.

In our PVE cluster, we have two physical networks:
  • bond0 LACP (802.3ad) (2x10G). MTU 9000. VLAN's: VM access networks + management
  • bond1 LACP (802.3ad) (2x40G). MTU 9000. VLAN's: storage + backups + cluster
We run corosync both on management and cluster networks.
We know that best practices recommends a separate physical for management and cluster, but hardware don't allow more pci card's per node.

We've searched the logs and found this:

Code:
Dec 31 03:55:05 vrt-hv01 corosync[2836]:   [TOTEM ] A processor failed, forming new configuration: token timed out (9500ms), waiting 11400ms for consensus.
Dec 31 04:06:03 vrt-hv05 corosync[3617]:   [KNET  ] pmtud: Global data MTU changed to: 8885

MTU both on switches and physical interfaces is set to 9000. Perhaps, could it be a good idea to bring down the MTU for the corosync vlan interface?

Code:
totem {
  netmtu: 1397
  ...
}
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!