CEPH - osd problems every night

andersh

Active Member
Feb 12, 2019
7
0
41
124
I have a 5 node cluster running Proxmox 8.0.3. It has been running fine for several months without any problems.

After rebooting 3 of the nodes last week, I've started to have problems with CEPH. Every night at around 02:05 (00:00 UTC) I see loads of error messages on all OSDs, and VMs lose disk access and become unresponsive. This goes on for around 30 minutes, and then then the system is back to normal again.

I'm unable to figure out if this is an external (network) problem, or if there is something running on the Proxmox servers that is triggering this behaviour. Any suggestions are appreciated.

The error messages we see is typically a continuous stream of "heartbeat checks / no reply" and "get_health_metrics / slow ops", for all OSDs on each cluster node. Out of 3 nodes with OSDs, I only see "no reply" from 2 of them.

Examples:

2023-10-10T02:04:08.152198+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:08.145+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:09.122869+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:09.117+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)

2023-10-10T02:04:34.143328+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:34.137+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:34.196768+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:34.193+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)

2023-10-10T02:04:42.164846+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:42.161+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:43.205803+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:43.201+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
 
Bumping this. Any pointers as to why this is happening at exactly the same time (appx. 02:00) every night would be appreciated.
 
Hello,

What kind of latency is there between nodes during the day? What about at the time they go down?

Could you please share us both the contents of `/etc/pve/corosync.conf` and the output of `pvecm status`?
 
What kind of latency is there between nodes during the day? What about at the time they go down?

Could you please share us both the contents of `/etc/pve/corosync.conf` and the output of `pvecm status`?
Normal latency on all interfaces is around 0.1ms, I have not yet stayed up at night to see what i might be when the problem starts.

Also the nodes does not go down. It is only ceph that experience these problems.

VMs with disks on ceph behave badly during the period with these problems, moving their disks to local storage makes them behave nicely even when these problems occur.

Corosync is running on two separate interfaces connected to two separate switches (different from Ceph), I have not seen any messages indicating that we are not quorate.

pvecm status (I have currently shut down one node, the problem is the same regardless of this):

Cluster information
-------------------
Name: Production
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Oct 18 12:24:33 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 2.2d6
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.254.0.80
0x00000003 1 10.254.0.82 (local)
0x00000004 1 10.254.0.83
0x00000005 1 10.254.0.84




corosync.conf attached.
 

Attachments

  • corosync.conf.txt
    924 bytes · Views: 2
Softly bumping this again. We're still seeing this problem, and any information as to what could cause ceph/osd problems at regular intervals every night would be appreciated.
 
Do you create backups around that time? Do VM's create application backups around that time? Are you monitoring your disk/network activity? [1]

There's a lot of missing information here, but I bet that some drive in your ceph is failing/misbehaving (maybe osd.4 on host 10.250.0.82, check logs of other days if it's always the same drive) and some access pattern makes it become too slow to cope with the load. This usually happens during backups.


[1] https://pve.proxmox.com/wiki/External_Metric_Server and https://pve.proxmox.com//pve-docs/chapter-sysadmin.html#external_metric_server
 
This is not related to backups, we are backing up to PBS 4 times a day, and the last backup does finish appx. 90 minutes before the problems start.

As far as I know, there is nothing in particular happening at this time, but it occurs regular as clockwork a few minutes past 2AM every night.

I will try and enable external metrics to see if anything stands out.

As I wrote, there are identical error messages for all OSDs, not only a single OSD. But I will try to stop osd.4 and see if that changes anything.
 
Hello,
Have you solved it? Encountered a similar problem and don't know how to deal with it.
 
Hello,
Have you solved it? Encountered a similar problem and don't know how to deal with it.
Not really. I've build the cluster from scratch, reinstalling Proxmox and still see the same behaviour. A little after midnight (UTC time) the problem appear.

The only progress I've made is that this is related some of my SSDs, but not all of them. I've got appx. 20 Sandisk Cloudspeed Eco Gen II (SDLF1CRR-019T-1HA1) 2TB drives. I've isolated 3-4 disks that are running just fine, when I add others to the cluster the problem starts again. Running SMART tests on them, they all seem to be fine.

I've also created a ZFS file system with some of the problematic disks, and that is running just fine, so it would seem that it is only Ceph that does not like these disks.
 
The SSDs don't seem so bad now.
However, my experience with SanDisk is that the SMART values can change several times and can therefore no longer provide a clear statement. Take a look at the SMART values and think about whether it can be plausible or not.

But if you also say that it is always around midnight, what happens at that time? Don't just think about direct connections but also indirectly. This can certainly be triggered by others, especially in shared environments.

What kind of hardware do you generally use? Can you make sure that the CPU is not overloaded at that time, for example, or that the link is busy?
 
Not really. I've build the cluster from scratch, reinstalling Proxmox and still see the same behaviour. A little after midnight (UTC time) the problem appear.

The only progress I've made is that this is related some of my SSDs, but not all of them. I've got appx. 20 Sandisk Cloudspeed Eco Gen II (SDLF1CRR-019T-1HA1) 2TB drives. I've isolated 3-4 disks that are running just fine, when I add others to the cluster the problem starts again. Running SMART tests on them, they all seem to be fine.

I've also created a ZFS file system with some of the problematic disks, and that is running just fine, so it would seem that it is only Ceph that does not like these disks.

Found this on Google and thought I'd test it since I use a similar RAID card.

https://patchwork.kernel.org/projec...it-send-email-newtongao@tencent.com/#23399701
 
The SSDs don't seem so bad now.
However, my experience with SanDisk is that the SMART values can change several times and can therefore no longer provide a clear statement. Take a look at the SMART values and think about whether it can be plausible or not.

But if you also say that it is always around midnight, what happens at that time? Don't just think about direct connections but also indirectly. This can certainly be triggered by others, especially in shared environments.

What kind of hardware do you generally use? Can you make sure that the CPU is not overloaded at that time, for example, or that the link is busy?
Thanks for the suggestions. I've looked for things happening at this time, but can't really find anything.
Ceph is running in its own VLAN, but I have other systems (including another - functional - Ceph cluster) running on other separate VLANs on the same switches. The only thing I see having problems is Ceph and it always starts between 00:10 and 00:15, and seems to last until all affected disks/OSDs have calmed down again (anything from 3 to 30 minutes, depending of number of drives in the cluster).
These are Dell R730 servers with ample CPU and memory.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!