Long heartbeat ping times on back interface seen

Hi!

We just upgraded our Ceph nodes to PVE 7.2 with kernel 5.15.39-4 and Ceph 16.2.9 and experience this exact issue with OSD_SLOW_PING_TIME_FRONT/BACK.

Previous version PVE 7.1 was running kernel 5.13.19-6 and Ceph 16.2.7 very stable for months.

Hardware is Supermicro X11/X12 with Mellanox ConnectX5 100G and Intel PCIe HHHL NVMe + Seagate Exos HDs.

Any advice on which 5.15 kernel are stable to run? Is 5.15.53-1 a good option? Or should we reboot into 5.13.19-6 for now?
 
Yeah, in our particular case, the only 5.15 version kernel that wasn't AS problematic was 5.15.35-2-pve

We are currently running 5.13.19-6-pve as that has proven to be the fastest so far .. we haven't tested the 5.15.53-1-pve kernel yet .. For our customers we MUST have the stability so I'm not sure we will even try to go to the 5.15.53-1-pve kernel .. we'll see
--- edit ---
our "long heartbeat ping times" first went away on 5.15.35-2-pve

just for extra clarity
 
Last edited:
seems the problem is back - we've noticed with Kernel 5.15.108-1-pve

Code:
 cluster [WRN] Health check failed: Slow OSD heartbeats on back (longest 1064.667ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check failed: Slow OSD heartbeats on front (longest 1244.808ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2670.215ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 1744.936ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2889.051ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2889.027ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 1872.833ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2481.821ms) (OSD_SLOW_PING_TIME_FRONT)

started as a VM or two where migrated from one node to another.
 
We started to see the same with a 7 node cluster after upgrading to the newest version. Kernel 6.2.16-4-pve, Ceph 17.2.6 . We never saw this before, no other changes in cluster. 60 SSD's via 2x10G fiber to switches.

[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 4854.443ms)
Slow OSD heartbeats on back from osd.50 [] to osd.47 [] 4854.443 msec
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 4579.349ms)
Slow OSD heartbeats on front from osd.50 [] to osd.47 [] 4579.349 msec

Pings through both ceph networks are under 0.1ms between nodes.

Is there anybody who sorted it out, and found the root cause?
 
Last edited:
I don't believe we ever got root cause. Remediation came via upgrading to a kernel that didn't have the issue.