Ceph OSD failures on good drives

sdutremble

Renowned Member
Sep 29, 2011
85
0
71
I have a three-node up-to-date (no subscription) Proxmox cluster home lab with AMD 8-core CPUs, RAID card X8 sata in JBOD and 1 SSD for the system. Network is 1Gbps on unmanaged switches. Each node is also used as a 4-OSD Ceph node with an isolated 2 x 1Gbps NICs bonded Round-Robin redundant link using separate switches with STP active. Bandwidth is 1.7 Gbps on the Bond and connectivity is fine to all three Ceph nodes. Each OSD is a WD Red 1TB 2.5" SATA drive but my first node also has 3 other drives used for testing BTRFS (separate of Ceph) and 2 unused connected drives.

I have 3 Ceph pools configured: data (2/2) is 25% full, pve (3/2) is 10% full and metadata (3/2) is minimal. I use the data pool for CephFS testing and the pve pool for rbd. PGs and PGPs is configured equally at 512 for all pools. When all drives are Up/In and the Ceph shows as Healthy, access is acceptable. I can easily use data on CephFS to serve mp4 or mkv movies remotely in HD quality (LAN) and the KVM VM functionality is fine.

I had to replace one OSD on node 1 with another unused drive connected to the same controller in order to get the Ceph system stable. I had to try two drives before I could get Ceph stable though.

This is what happened:

  • One of the original Ceph OSD failed after a few hours of having a Ceph Healthy system. The first spare drive (Fujitsu 1 TB 3.5") I tried also fails. The second unused drive (also a Fujitsu 1 TB 3.5") is working.

This is what I tried:

  • The two failing OSD drives pass all error checks (SMART, gparted, gdisk) and nothing shows up in any logs.
  • Every time I attempt to add them as OSD (through the Proxmox web interface), creation is successful. They are Up/In and the entire Ceph system attempts to re-balance.
  • After a few hours, the re-balance is almost completed but stops at less than 1% remaining and the OSD goes Down/Out. Any attempt to start it fails with an error 1 dialog.
  • I have attempted to re-configure these two drives as OSD at least 3 times with zapping the drives in-between attempts with no change.

I can not understand why these two drives do not work as OSDs. The behavior is repeatable and re-balance usually starts between 20 and 30%. Ceph logs show slow progress with many pages unhealthy.

Can anyone give me a hint on where to look?

Serge
 
Hey Serge,

I can't say that I know what your issue is but I would suspect a possible sata controller driver issue. Check your /var/log/messages and see if there are a bunch of commands the kernel is logging in relation to the scsi driver used. I have seen certain chipsets where the driver was buggy and I was getting very inconsistent results with healthy drives. Changing the controllers (therefore using a different driver) solved it for me pretty much each time. This is just a hunch like I said earlier I can't be sure. I'm using ceph on a 5 node cluster with wd reds and ssds for journaling and I have had no issues and quite wonderful performance.

I'm guessing the firmware of your failing drives is likely using a certain command that is buggy in the opensource driver hence crashing it.

Hope this helps.

Paul
 
Thanks Paul.

I would be surprised if it is a driver issue:


  • I have 4 of the WD Red 1TB 2.5" and only one fails as a Ceph OSD.
  • I have 2 of the Fujitsu 1 TB 3.5" and only one fails as a Ceph OSD.

I will look again in /var/log/syslog to see if there is a hint specific to the kernel and scsi driver. However, there I noticed no such error.

I would appreciate additional suggestions on how I could pursue this troubleshooting.

Serge.
 
Thanks Paul.

I would be surprised if it is a driver issue:


  • I have 4 of the WD Red 1TB 2.5" and only one fails as a Ceph OSD.
  • I have 2 of the Fujitsu 1 TB 3.5" and only one fails as a Ceph OSD.

I will look again in /var/log/syslog to see if there is a hint specific to the kernel and scsi driver. However, there I noticed no such error.

I would appreciate additional suggestions on how I could pursue this troubleshooting.

Serge.
Hi,
to see if it's an hdd-issue simply use the failed disk in another node as osd.

Udo