Write Errors in ZFS, but not in Ceph

bqq100

Member
Jun 8, 2021
16
0
6
40
A drive on one of my nodes is constantly throwing ZFS write errors on my rpool (triple mirror, so not currently worried about data loss). Based on SMART self test, age of the drive, and the fact that I think it started when I was moving some cabling around, I'm pretty confident the issue is due to a bad SATA cable.

However this drive also has a partition used as a Ceph OSD, and I can't find any indication of any problems. I've searched the ceph-osd.#.log and ceph.log files and I haven't found any indication of any errors. I'm pretty sure that if ZFS is seeing write errors, then ceph is seeing them too. I trust that Ceph is taking care of things in the background, and I'm not worried about data loss, but if this was to happen on a drive not in a ZFS pool, I would like to see the issue and take action before it becomes a larger issue.

Does anyone know of any specific log entries or other places I can look to see if there are write errors? Has anyone else run into ceph being too good at self repairing in the background so you aren't aware of minor issues until it becomes a major one?

Thanks!
 
What experiences does everyone else have with failed drives in ceph? Does ceph keep it running until a complete failure where the drive is marked as down/out? Any other places where failures may crop up before a complete failure?

Thanks!
 
dont do this. just dont. you will have dismal performance and if a drive is having issues. you will be impacted on the ceph pool- you just havent been hitting it yet.

How would a drive with a hardware issue be much different with and without another partition on the drive? You will still have throughput contention if both partitions are being used, but that is the case whether or not the drive is bad (and I've already considered this and I find the benefits of sharing the drive to outweigh the performance hit for my use case).

In any case I would take dismal performance or any kind of indication that ceph knows there is a problem with a drive. Right now ZFS is showing there is an intermittent problem writing to the drive while ceph is showing that everything is just fine. Based on my experience with ZFS I completely trust there is something actually wrong with the drive.

I am new to ceph, so I don't have the confidence that ceph is just silently retrying these intermittent write failures until it succeeds and that it will kick the OSD out if it becomes more than just a sporadic issue. That is why I was hoping to hear what other long time ceph users have seen when drives do fail.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!