Ceph OSD failures on good drives

sdutremble

Renowned Member
Sep 29, 2011
85
0
71
I have a three-node up-to-date (no subscription) Proxmox cluster home lab with AMD 8-core CPUs, RAID card X8 sata in JBOD and 1 SSD for the system. Network is 1Gbps on unmanaged switches. Each node is also used as a 4-OSD Ceph node with an isolated 2 x 1Gbps NICs bonded Round-Robin redundant link using separate switches with STP active. Bandwidth is 1.7 Gbps on the Bond and connectivity is fine to all three Ceph nodes. Each OSD is a WD Red 1TB 2.5" SATA drive but my first node also has 3 other drives used for testing BTRFS (separate of Ceph) and 2 unused connected drives.

I have 3 Ceph pools configured: data (2/2) is 25% full, pve (3/2) is 10% full and metadata (3/2) is minimal. I use the data pool for CephFS testing and the pve pool for rbd. PGs and PGPs is configured equally at 512 for all pools. When all drives are Up/In and the Ceph shows as Healthy, access is acceptable. I can easily use data on CephFS to serve mp4 or mkv movies remotely in HD quality (LAN) and the KVM VM functionality is fine.

I had to replace one OSD on node 1 with another unused drive connected to the same controller in order to get the Ceph system stable. I had to try two drives before I could get Ceph stable though.

This is what happened:

  • One of the original Ceph OSD failed after a few hours of having a Ceph Healthy system. The first spare drive (Fujitsu 1 TB 3.5") I tried also fails. The second unused drive (also a Fujitsu 1 TB 3.5") is working.

This is what I tried:

  • The two failing OSD drives pass all error checks (SMART, gparted, gdisk) and nothing shows up in any logs.
  • Every time I attempt to add them as OSD (through the Proxmox web interface), creation is successful. They are Up/In and the entire Ceph system attempts to re-balance.
  • After a few hours, the re-balance is almost completed but stops at less than 1% remaining and the OSD goes Down/Out. Any attempt to start it fails with an error 1 dialog.
  • I have attempted to re-configure these two drives as OSD at least 3 times with zapping the drives in-between attempts with no change.

I can not understand why these two drives do not work as OSDs. The behavior is repeatable and re-balance usually starts between 20 and 30%. Ceph logs show slow progress with many pages unhealthy.

Can anyone give me a hint on where to look?

Serge
 
Hey Serge,

I can't say that I know what your issue is but I would suspect a possible sata controller driver issue. Check your /var/log/messages and see if there are a bunch of commands the kernel is logging in relation to the scsi driver used. I have seen certain chipsets where the driver was buggy and I was getting very inconsistent results with healthy drives. Changing the controllers (therefore using a different driver) solved it for me pretty much each time. This is just a hunch like I said earlier I can't be sure. I'm using ceph on a 5 node cluster with wd reds and ssds for journaling and I have had no issues and quite wonderful performance.

I'm guessing the firmware of your failing drives is likely using a certain command that is buggy in the opensource driver hence crashing it.

Hope this helps.

Paul
 
Thanks Paul.

I would be surprised if it is a driver issue:


  • I have 4 of the WD Red 1TB 2.5" and only one fails as a Ceph OSD.
  • I have 2 of the Fujitsu 1 TB 3.5" and only one fails as a Ceph OSD.

I will look again in /var/log/syslog to see if there is a hint specific to the kernel and scsi driver. However, there I noticed no such error.

I would appreciate additional suggestions on how I could pursue this troubleshooting.

Serge.
 
Thanks Paul.

I would be surprised if it is a driver issue:


  • I have 4 of the WD Red 1TB 2.5" and only one fails as a Ceph OSD.
  • I have 2 of the Fujitsu 1 TB 3.5" and only one fails as a Ceph OSD.

I will look again in /var/log/syslog to see if there is a hint specific to the kernel and scsi driver. However, there I noticed no such error.

I would appreciate additional suggestions on how I could pursue this troubleshooting.

Serge.
Hi,
to see if it's an hdd-issue simply use the failed disk in another node as osd.

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!