CEPH - osd problems every night

andersh

Active Member
Feb 12, 2019
10
0
41
124
I have a 5 node cluster running Proxmox 8.0.3. It has been running fine for several months without any problems.

After rebooting 3 of the nodes last week, I've started to have problems with CEPH. Every night at around 02:05 (00:00 UTC) I see loads of error messages on all OSDs, and VMs lose disk access and become unresponsive. This goes on for around 30 minutes, and then then the system is back to normal again.

I'm unable to figure out if this is an external (network) problem, or if there is something running on the Proxmox servers that is triggering this behaviour. Any suggestions are appreciated.

The error messages we see is typically a continuous stream of "heartbeat checks / no reply" and "get_health_metrics / slow ops", for all OSDs on each cluster node. Out of 3 nodes with OSDs, I only see "no reply" from 2 of them.

Examples:

2023-10-10T02:04:08.152198+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:08.145+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)
2023-10-10T02:04:09.122869+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:09.117+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)

2023-10-10T02:04:34.143328+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:34.137+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:34.196768+02:00 pve-p3-oa68 ceph-osd[2965264]: 2023-10-10T02:04:34.193+0200 7fb39115c6c0 -1 osd.5 13130 heartbeat_check: no reply from 10.250.0.82:6844 osd.4 since back 2023-10-10T02:03:41.944892+0200 front 2023-10-10T02:03:41.944860+0200 (oldest deadline 2023-10-10T02:04:07.844962+0200)

2023-10-10T02:04:42.164846+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:42.161+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 3 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
2023-10-10T02:04:43.205803+02:00 pve-p3-oa68 ceph-osd[2967036]: 2023-10-10T02:04:43.201+0200 7f3d6d9cf6c0 -1 osd.4 13130 get_health_metrics reporting 4 slow ops, oldest is osd_op(client.43154600.0:944725 2.48 2:13168fca:::rbd_data.5cef5210d8ae99.000000000000097e:head [write 3940352~4096 in=4096b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio e13130)
 
Bumping this. Any pointers as to why this is happening at exactly the same time (appx. 02:00) every night would be appreciated.
 
Hello,

What kind of latency is there between nodes during the day? What about at the time they go down?

Could you please share us both the contents of `/etc/pve/corosync.conf` and the output of `pvecm status`?
 
What kind of latency is there between nodes during the day? What about at the time they go down?

Could you please share us both the contents of `/etc/pve/corosync.conf` and the output of `pvecm status`?
Normal latency on all interfaces is around 0.1ms, I have not yet stayed up at night to see what i might be when the problem starts.

Also the nodes does not go down. It is only ceph that experience these problems.

VMs with disks on ceph behave badly during the period with these problems, moving their disks to local storage makes them behave nicely even when these problems occur.

Corosync is running on two separate interfaces connected to two separate switches (different from Ceph), I have not seen any messages indicating that we are not quorate.

pvecm status (I have currently shut down one node, the problem is the same regardless of this):

Cluster information
-------------------
Name: Production
Config Version: 9
Transport: knet
Secure auth: on

Quorum information
------------------
Date: Wed Oct 18 12:24:33 2023
Quorum provider: corosync_votequorum
Nodes: 4
Node ID: 0x00000003
Ring ID: 2.2d6
Quorate: Yes

Votequorum information
----------------------
Expected votes: 5
Highest expected: 5
Total votes: 4
Quorum: 3
Flags: Quorate

Membership information
----------------------
Nodeid Votes Name
0x00000002 1 10.254.0.80
0x00000003 1 10.254.0.82 (local)
0x00000004 1 10.254.0.83
0x00000005 1 10.254.0.84




corosync.conf attached.
 

Attachments

Softly bumping this again. We're still seeing this problem, and any information as to what could cause ceph/osd problems at regular intervals every night would be appreciated.
 
Do you create backups around that time? Do VM's create application backups around that time? Are you monitoring your disk/network activity? [1]

There's a lot of missing information here, but I bet that some drive in your ceph is failing/misbehaving (maybe osd.4 on host 10.250.0.82, check logs of other days if it's always the same drive) and some access pattern makes it become too slow to cope with the load. This usually happens during backups.


[1] https://pve.proxmox.com/wiki/External_Metric_Server and https://pve.proxmox.com//pve-docs/chapter-sysadmin.html#external_metric_server
 
This is not related to backups, we are backing up to PBS 4 times a day, and the last backup does finish appx. 90 minutes before the problems start.

As far as I know, there is nothing in particular happening at this time, but it occurs regular as clockwork a few minutes past 2AM every night.

I will try and enable external metrics to see if anything stands out.

As I wrote, there are identical error messages for all OSDs, not only a single OSD. But I will try to stop osd.4 and see if that changes anything.
 
Hello,
Have you solved it? Encountered a similar problem and don't know how to deal with it.
 
Hello,
Have you solved it? Encountered a similar problem and don't know how to deal with it.
Not really. I've build the cluster from scratch, reinstalling Proxmox and still see the same behaviour. A little after midnight (UTC time) the problem appear.

The only progress I've made is that this is related some of my SSDs, but not all of them. I've got appx. 20 Sandisk Cloudspeed Eco Gen II (SDLF1CRR-019T-1HA1) 2TB drives. I've isolated 3-4 disks that are running just fine, when I add others to the cluster the problem starts again. Running SMART tests on them, they all seem to be fine.

I've also created a ZFS file system with some of the problematic disks, and that is running just fine, so it would seem that it is only Ceph that does not like these disks.
 
The SSDs don't seem so bad now.
However, my experience with SanDisk is that the SMART values can change several times and can therefore no longer provide a clear statement. Take a look at the SMART values and think about whether it can be plausible or not.

But if you also say that it is always around midnight, what happens at that time? Don't just think about direct connections but also indirectly. This can certainly be triggered by others, especially in shared environments.

What kind of hardware do you generally use? Can you make sure that the CPU is not overloaded at that time, for example, or that the link is busy?
 
Not really. I've build the cluster from scratch, reinstalling Proxmox and still see the same behaviour. A little after midnight (UTC time) the problem appear.

The only progress I've made is that this is related some of my SSDs, but not all of them. I've got appx. 20 Sandisk Cloudspeed Eco Gen II (SDLF1CRR-019T-1HA1) 2TB drives. I've isolated 3-4 disks that are running just fine, when I add others to the cluster the problem starts again. Running SMART tests on them, they all seem to be fine.

I've also created a ZFS file system with some of the problematic disks, and that is running just fine, so it would seem that it is only Ceph that does not like these disks.

Found this on Google and thought I'd test it since I use a similar RAID card.

https://patchwork.kernel.org/projec...it-send-email-newtongao@tencent.com/#23399701
 
The SSDs don't seem so bad now.
However, my experience with SanDisk is that the SMART values can change several times and can therefore no longer provide a clear statement. Take a look at the SMART values and think about whether it can be plausible or not.

But if you also say that it is always around midnight, what happens at that time? Don't just think about direct connections but also indirectly. This can certainly be triggered by others, especially in shared environments.

What kind of hardware do you generally use? Can you make sure that the CPU is not overloaded at that time, for example, or that the link is busy?
Thanks for the suggestions. I've looked for things happening at this time, but can't really find anything.
Ceph is running in its own VLAN, but I have other systems (including another - functional - Ceph cluster) running on other separate VLANs on the same switches. The only thing I see having problems is Ceph and it always starts between 00:10 and 00:15, and seems to last until all affected disks/OSDs have calmed down again (anything from 3 to 30 minutes, depending of number of drives in the cluster).
These are Dell R730 servers with ample CPU and memory.
 
Hello everyone.
A little update for anyone who had the same issue and went here from google.
We had the same issue with a new 3 node (30 OSD) cluster.
After a 3 weeks of running without any issues, some OSD goes down every ~00:00 UTC with the following message in dmesg:
Code:
sd 0:0:23:0: Power-on or device reset occurred

The reason was the ceph device SMART scraping, which is enabled by default and runs every 24 hours.
This issue can be easily reproduced by issuing smart scraping in a bulk:
Code:
#/usr/bin/env bash

/usr/sbin/smartctl -x --json=o /dev/sda
/usr/sbin/smartctl -x --json=o /dev/sdb
/usr/sbin/smartctl -x --json=o /dev/sdc
/usr/sbin/smartctl -x --json=o /dev/sdd
/usr/sbin/smartctl -x --json=o /dev/sde
/usr/sbin/smartctl -x --json=o /dev/sdf
/usr/sbin/smartctl -x --json=o /dev/sdj
/usr/sbin/smartctl -x --json=o /dev/sdh
/usr/sbin/smartctl -x --json=o /dev/sdi
/usr/sbin/smartctl -x --json=o /dev/sdj

Sometimes, that leads to an IO pause on several disks and device resets.
In our case, the problem observed with the following hardware:
Code:
$ lspci | grep -i sas
1c:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)

$ smartctl -i /dev/sda
<...>
=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     SAMSUNG MZ7LH1T9HMLT-00005


You can check the current scrape config with:
Code:
$ ceph config get mgr mgr/devicehealth/enable_monitoring
true

$ ceph config get mgr mgr/devicehealth/scrape_frequency
86400

For disable scraping, simply run:
Code:
$ ceph device monitoring off

$ ceph config get mgr mgr/devicehealth/enable_monitoring
false
Since then, the cluster works stable without any problems.
 
For disable scraping, simply run:
Code:
$ ceph device monitoring off

$ ceph config get mgr mgr/devicehealth/enable_monitoring
false
Since then, the cluster works stable without any problems.

It's great that you could resolve your issues. I'm curious though: Isn't it bad to disable monitoring because then you won't notice potential issues with the disk? Or am I missing something?
 
It's great that you could resolve your issues. I'm curious though: Isn't it bad to disable monitoring because then you won't notice potential issues with the disk? Or am I missing something?
We use centralized monitoring for that, including ceph state, SMART etc.
So, in our scenario, disabling built-in scraping is acceptable.
 
  • Like
Reactions: Johannes S
Hello everyone.
A little update for anyone who had the same issue and went here from google.
We had the same issue with a new 3 node (30 OSD) cluster.
After a 3 weeks of running without any issues, some OSD goes down every ~00:00 UTC with the following message in dmesg:
Code:
sd 0:0:23:0: Power-on or device reset occurred
Thanks for the detailed explanation. I can confirm that this matches my initial problem.

We observed this on two Dell R730s. The backplane and RAID controller was using the latest firmware, but after upgrading the BIOS to the latest version the problem seems to have gone away.
 
Code:
$ lspci | grep -i sas
1c:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)

[/QUOTE]
[USER=271245]@dllex[/USER]  By the way, is the RAID controller set to HBA mode, or are you mapping single disks out of the controller?
 
@andersh
It's in HBA mode of course.
I see that this problem occur in either mode, but it is possible to work around it - at least on our Dell servers.

What I see is that SMART is working fine initially after adding a new disk.
After I create an OSD on the the disk, SMART will hang.
If I then reboot the server to Dell "lifecycle controller" and just exit and reboot, SMART will work fine again on the disk.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!