I have a 5 node cluster running Proxmox 8.0.3. It has been running fine for several months without any problems.
After rebooting 3 of the nodes last week, I've started to have problems with CEPH. Every night at around 02:05 (00:00 UTC) I see loads of error messages on all OSDs, and VMs lose disk access and become unresponsive. This goes on for around 30 minutes, and then then the system is back to normal again.
I'm unable to figure out if this is an external (network) problem, or if there is something running on the Proxmox servers that is triggering this behaviour. Any suggestions are appreciated.
The error messages we see is typically a continuous stream of "heartbeat checks / no reply" and "get_health_metrics / slow ops", for all OSDs on each cluster node. Out of 3 nodes with OSDs, I only see "no reply" from 2 of them.
Examples:
2023-10-10T02:04:08.152198+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:08.145+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:09.122869+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:09.117+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:34.143328+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:34.137+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:34.196768+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:34.193+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:42.164846+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:42.161+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:43.205803+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:43.201+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
After rebooting 3 of the nodes last week, I've started to have problems with CEPH. Every night at around 02:05 (00:00 UTC) I see loads of error messages on all OSDs, and VMs lose disk access and become unresponsive. This goes on for around 30 minutes, and then then the system is back to normal again.
I'm unable to figure out if this is an external (network) problem, or if there is something running on the Proxmox servers that is triggering this behaviour. Any suggestions are appreciated.
The error messages we see is typically a continuous stream of "heartbeat checks / no reply" and "get_health_metrics / slow ops", for all OSDs on each cluster node. Out of 3 nodes with OSDs, I only see "no reply" from 2 of them.
Examples:
2023-10-10T02:04:08.152198+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:08.145+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:09.122869+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:09.117+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:34.143328+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:34.137+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:34.196768+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:34.193+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:42.164846+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:42.161+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:43.205803+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:43.201+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)