Ceph Slow OSD Heartbeats on 3-node Direct Attached cluster

jimvman

Active Member
Oct 16, 2019
19
0
41
54
Hello again everybody,

I have a 100Gbps Mellanox adapter 3-node Direct attached ceph cluster running on the latest version of Proxmox. I have been periodically receiving these alerts and ceph goes to warning state then has to rebuild, which seems to slow things down for several minutes... (there are more errors on more osds that I'm not listing)

2025-07-29T16:50:00.000137-0500 mon.proxmox10 (mon.0) 110547 : cluster [WRN] [WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 4464.499ms)
2025-07-29T16:50:00.000146-0500 mon.proxmox10 (mon.0) 110548 : cluster [WRN] Slow OSD heartbeats on back from osd.8 [] to osd.1 [] 4464.499 msec
2025-07-29T16:50:00.000157-0500 mon.proxmox10 (mon.0) 110549 : cluster [WRN] Slow OSD heartbeats on back from osd.11 [] to osd.4 [] 2930.283 msec
2025-07-29T16:50:00.000168-0500 mon.proxmox10 (mon.0) 110550 : cluster [WRN] Slow OSD heartbeats on back from osd.2 [] to osd.9 [] 2694.922 msec
2025-07-29T16:50:00.000181-0500 mon.proxmox10 (mon.0) 110551 : cluster [WRN] Slow OSD heartbeats on back from osd.9 [] to osd.4 [] 2587.703 m

2025-07-29T16:50:00.000301-0500 mon.proxmox10 (mon.0) 110559 : cluster [WRN] [WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 4789.440ms)
2025-07-29T16:50:00.000314-0500 mon.proxmox10 (mon.0) 110560 : cluster [WRN] Slow OSD heartbeats on front from osd.8 [] to osd.1 [] 4789.440 msec
2025-07-29T16:50:00.000330-0500 mon.proxmox10 (mon.0) 110561 : cluster [WRN] Slow OSD heartbeats on front from osd.11 [] to osd.4 [] 3061.380 msec
2025-07-29T16:50:00.000344-0500 mon.proxmox10 (mon.0) 110562 : cluster [WRN] Slow OSD heartbeats on front from osd.9 [] to osd.4 [] 3048.511 msec
2025-07-29T16:50:00.000359-0500 mon.proxmox10 (mon.0) 110563 : cluster [WRN] Slow OSD heartbeats on front from osd.2 [] to osd.9 [] 2904.906 msec

I have upgraded the firmware on the Mellanox Technologies MT27700 Family [ConnectX-4] adapter in each of the 3 servers and rebooted but continue to periodically get these long heartbeats on front and back which puts ceph in a warn state for a bit.

Ceph is on a separate network from the cluster network.

Can anyone provide any guidance on what else could be the problem? Maybe I can slow the ceph network down a bit from 100Gbps, possible a DAC cabling issue?

Also hoping this doesn't lead to bigger problems as I add more VMs to the cluster.

Thanks a lot for any help in solving this issue!
 
This may help someone as well. I haven't made any changes yet but plan to make changes to keep the public and cluster networks on the same subnet.

The cluster is complaining that there is too much latency on the front network which is the public network.
You're not using the Mellanox 100G cards for the public network. Please make sure that there isn't too much congestion on the public network.

Ceph recommends to keep the public and cluster network together if possible and only to separate them if really necessary.



[0] https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/