Long heartbeat ping times on back interface seen

CTCcloud · Jul 5, 2022

@midsize_erp Yeah, for sure .. as far as this Ceph problem goes with the long ping times, it's gone. We are still seeing some odd VM behavior on these latest kernels (5.15.x) but Ceph has been fine.

Thomas Hukkelberg · Sep 25, 2022

Hi!

We just upgraded our Ceph nodes to PVE 7.2 with kernel 5.15.39-4 and Ceph 16.2.9 and experience this exact issue with OSD_SLOW_PING_TIME_FRONT/BACK.

Previous version PVE 7.1 was running kernel 5.13.19-6 and Ceph 16.2.7 very stable for months.

Hardware is Supermicro X11/X12 with Mellanox ConnectX5 100G and Intel PCIe HHHL NVMe + Seagate Exos HDs.

Any advice on which 5.15 kernel are stable to run? Is 5.15.53-1 a good option? Or should we reboot into 5.13.19-6 for now?

CTCcloud · Sep 26, 2022

Yeah, in our particular case, the only 5.15 version kernel that wasn't AS problematic was 5.15.35-2-pve

We are currently running 5.13.19-6-pve as that has proven to be the fastest so far .. we haven't tested the 5.15.53-1-pve kernel yet .. For our customers we MUST have the stability so I'm not sure we will even try to go to the 5.15.53-1-pve kernel .. we'll see
--- edit ---
our "long heartbeat ping times" first went away on 5.15.35-2-pve

just for extra clarity

VoIP-Ninja · Jul 27, 2023

seems the problem is back - we've noticed with Kernel 5.15.108-1-pve

Code:

 cluster [WRN] Health check failed: Slow OSD heartbeats on back (longest 1064.667ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check failed: Slow OSD heartbeats on front (longest 1244.808ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2670.215ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 1744.936ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 2889.051ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2889.027ms) (OSD_SLOW_PING_TIME_FRONT)
 cluster [WRN] Health check update: Slow OSD heartbeats on back (longest 1872.833ms) (OSD_SLOW_PING_TIME_BACK)
 cluster [WRN] Health check update: Slow OSD heartbeats on front (longest 2481.821ms) (OSD_SLOW_PING_TIME_FRONT)

started as a VM or two where migrated from one node to another.

gk_emmo · Aug 7, 2023

We started to see the same with a 7 node cluster after upgrading to the newest version. Kernel 6.2.16-4-pve, Ceph 17.2.6 . We never saw this before, no other changes in cluster. 60 SSD's via 2x10G fiber to switches.

[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 4854.443ms)
Slow OSD heartbeats on back from osd.50 [] to osd.47 [] 4854.443 msec
[WRN] OSD_SLOW_PING_TIME_FRONT: Slow OSD heartbeats on front (longest 4579.349ms)
Slow OSD heartbeats on front from osd.50 [] to osd.47 [] 4579.349 msec

Pings through both ceph networks are under 0.1ms between nodes.

Is there anybody who sorted it out, and found the root cause?

midsize_erp · Aug 8, 2023

I don't believe we ever got root cause. Remediation came via upgrading to a kernel that didn't have the issue.

gk_emmo · Aug 8, 2023

midsize_erp said:
I don't believe we ever got root cause. Remediation came via upgrading to a kernel that didn't have the issue.

Weird thing is, that the issue gone away and sorted itself. There were no change in config or kernel.

Search

Search

Long heartbeat ping times on back interface seen

CTCcloud

Renowned Member

Thomas Hukkelberg

Active Member

CTCcloud

Renowned Member

VoIP-Ninja

Active Member

gk_emmo

Active Member

midsize_erp

Renowned Member

gk_emmo

Active Member

We value your privacy