mons down: reboot of the physical node (Proxmox)

oscar.mas.ilimit · Jan 16, 2023

I would like to ask you for help because I am running out of ideas on how to solve our issue.
A couple of months ago (just after updating from Proxmox 7.2 to 7.3), we began to receive messages from our Ceph system indicating that there was a "health warn".

The messages we receive are like the following (not always the same monitor):

Code:

[WARN] MON_DOWN: 1/5 mons down, quorum vrt-05,vrt-08,vrt-02,vrt-09
mon.vrt-01 (rank 4) addr [v2:10.61.12.201:3300/0,v1:10.61.12.201:6789/0] is down (out of quorum)

Our cluster recovers usually after a few minutes and when we launch the command "ceph -s" everything is correct and the Proxmox system works fine.
But sometimes after we receive messages like those, we have experimented a reboot of the physical node (Proxmox). This only happens with nodes that are running the monitor service.

Our Hyper-Converged Proxmox cluster is made up of 12 physical nodes (48 Cores / 384GB RAM each node).
In every physical machine we have 4 OSD (Intel D3-S4610 3.84TB).
Five nodes are also monitors.
Our Ceph is currently at 62% of capacity
Nodes are currently using 10-20% cpu and at 40-50% RAM
All OSD's are currently under 10ms of "Apply/Commit Latency"

Note:
Proxmox PVE: 7.3-3 (enterprise repo)
Ceph: 16.2.9
Kernel: 5.15.74-1-pve

Any help will be appreciated

gurubert · Jan 19, 2023

It could be that the software watchdog of Proxmox's corosync cluster is the culprit here. When it loses the connection to the other nodes it may reboot itself. Does the corosync link (or links) share traffic with other applications?

oscar.mas.ilimit · Jan 24, 2023

Thanks gurubert.

In our PVE cluster, we have two physical networks:

bond0 LACP (802.3ad) (2x10G). MTU 9000. VLAN's: VM access networks + management
bond1 LACP (802.3ad) (2x40G). MTU 9000. VLAN's: storage + backups + cluster

We run corosync both on management and cluster networks.
We know that best practices recommends a separate physical for management and cluster, but hardware don't allow more pci card's per node.

We've searched the logs and found this:

Code:

Dec 31 03:55:05 vrt-hv01 corosync[2836]:   [TOTEM ] A processor failed, forming new configuration: token timed out (9500ms), waiting 11400ms for consensus.
Dec 31 04:06:03 vrt-hv05 corosync[3617]:   [KNET  ] pmtud: Global data MTU changed to: 8885

MTU both on switches and physical interfaces is set to 9000. Perhaps, could it be a good idea to bring down the MTU for the corosync vlan interface?

Code:

totem {
  netmtu: 1397
  ...
}

Search

Search

mons down: reboot of the physical node (Proxmox)

oscar.mas.ilimit

Member

gurubert

Famous Member

oscar.mas.ilimit

Member