OSD struggles

wassupluke

New Member
Jan 17, 2025
13
2
3
Been greatly struggling to keep OSDs from crashing. Maybe someone smarter than me here can help figure things out from my journalctl -xeu ceph-osd@8 output here: https://pastebin.com/n178KLn5.

I currently have ceph HEALTH_ERR 6 scrub errors; Possible data damage: 4 pgs inconsistent; Degraded data redundancy: 14474/1269036 objects degraded (1.141%), 49 pgs degraded, 49 pgs undersized; 14 daemons have recently crashed so I'm trying to not screw over my data by monkeying with stuff too much without good insights.

ceph crash ls
Code:
2025-03-25T06:31:08.931893Z_0b7c785c-08e5-497f-9292-f39885a5e8a2  osd.8      *   
2025-03-25T06:33:03.516386Z_6ad361cb-e1e7-41cb-b73f-12dceff33fc0  osd.8      *   
2025-03-25T06:34:06.954658Z_0fe5c550-6ae5-42ca-8c54-c514a18ef0ba  osd.8      *   
2025-03-25T06:36:16.549778Z_34b043f1-4431-4220-8c9f-ff9666ebcef0  osd.8      *   
2025-03-25T11:34:27.490978Z_dfaaf2ac-091d-47d5-987a-8620a80a6d0a  osd.8      *   
2025-03-25T11:34:51.981244Z_9728e42c-9af4-4712-b020-c358fe15a98c  osd.8      *   
2025-03-25T11:35:04.646370Z_e2604221-ac8d-4972-9a15-868b683916f5  osd.8      *   
2025-03-25T11:38:32.769314Z_4c491d05-7c91-4e9b-a232-a2e19910c4e1  osd.8      *   
2025-03-25T11:38:45.915009Z_a11aaf41-1955-48f3-b0a9-f722c8e543c8  osd.8      *   
2025-03-25T11:38:58.380070Z_7bb3c68c-0c87-4f75-bcd5-a7fbb85cbeb7  osd.8      *   
2025-03-25T11:39:11.045116Z_444e7e27-a9db-47d3-a68e-e6d8b9d8ade7  osd.8      *   
2025-03-25T11:39:23.718586Z_3de56e57-42ac-4018-806c-e17a0c2f01dd  osd.8      *   
2025-03-25T11:40:20.778090Z_0a2eac4a-3e84-48d6-98df-54a5e134b7d5  osd.8      *   
2025-03-25T11:40:34.035661Z_1776651d-84e5-4f49-928d-aca4ac757da4  osd.8      *

pveversion --verbose
Code:
proxmox-ve: 8.3.0 (running kernel: 6.8.12-8-pve)
pve-manager: 8.3.5 (running version: 8.3.5/dac3aa88bac3f300)
proxmox-kernel-helper: 8.1.1
proxmox-kernel-6.8: 6.8.12-8
proxmox-kernel-6.8.12-8-pve-signed: 6.8.12-8
proxmox-kernel-6.8.12-4-pve-signed: 6.8.12-4
ceph: 19.2.0-pve2
ceph-fuse: 19.2.0-pve2
corosync: 3.1.7-pve3
criu: 3.17.1-2+deb12u1
glusterfs-client: 10.3-5
ifupdown2: 3.2.0-1+pmx11
intel-microcode: 3.20250211.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libknet1: 1.28-pve1
libproxmox-acme-perl: 1.6.0
libproxmox-backup-qemu0: 1.5.1
libproxmox-rs-perl: 0.3.5
libpve-access-control: 8.2.0
libpve-apiclient-perl: 3.3.2
libpve-cluster-api-perl: 8.0.10
libpve-cluster-perl: 8.0.10
libpve-common-perl: 8.2.9
libpve-guest-common-perl: 5.1.6
libpve-http-server-perl: 5.2.0
libpve-network-perl: 0.10.1
libpve-rs-perl: 0.9.2
libpve-storage-perl: 8.3.3
libspice-server1: 0.15.1-1
lvm2: 2.03.16-2
lxc-pve: 6.0.0-1
lxcfs: 6.0.0-pve2
novnc-pve: 1.5.0-1
proxmox-backup-client: 3.3.3-1
proxmox-backup-file-restore: 3.3.3-1
proxmox-firewall: 0.6.0
proxmox-kernel-helper: 8.1.1
proxmox-mail-forward: 0.3.1
proxmox-mini-journalreader: 1.4.0
proxmox-offline-mirror-helper: 0.6.7
proxmox-widget-toolkit: 4.3.6
pve-cluster: 8.0.10
pve-container: 5.2.4
pve-docs: 8.3.1
pve-edk2-firmware: 4.2023.08-4
pve-esxi-import-tools: 0.7.2
pve-firewall: 5.1.0
pve-firmware: 3.14-3
pve-ha-manager: 4.0.6
pve-i18n: 3.4.0
pve-qemu-kvm: 9.2.0-2
pve-xtermjs: 5.3.0-3
qemu-server: 8.3.8
smartmontools: 7.3-pve1
spiceterm: 3.3.0
swtpm: 0.8.0+pve1
vncterm: 1.8.0
zfsutils-linux: 2.2.7-pve1
 
$ lsblk
Code:
NAME                                                                               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                                                                                  8:0    0 111.8G  0 disk
├─sda1                                                                               8:1    0  1007K  0 part
├─sda2                                                                               8:2    0     1G  0 part /boot/efi
└─sda3                                                                               8:3    0 110.8G  0 part
  ├─pve-swap                                                                       252:2    0     8G  0 lvm  [SWAP]
  └─pve-root                                                                       252:3    0 102.8G  0 lvm  /
sdc                                                                                  8:32   0 931.5G  0 disk
└─ceph--2d4a209b--2d77--4204--871c--ceea1859489e-osd--block--0e6815a1--c56c--4525--871a--3b005f2eefc2
                                                                                   252:0    0 931.5G  0 lvm
sdd                                                                                  8:48   0 931.5G  0 disk
└─ceph--180f5e66--6089--4ef5--a4de--4bc8055035d4-osd--block--9b27f285--569f--41c0--aeec--8f2c76ff8344
                                                                                   252:1    0 931.5G  0 lvm
sde                                                                                  8:64   0 447.1G  0 disk
└─ceph--a8724084--08a0--4fd0--ae34--82b6cc66917a-osd--block--635764cd--0626--497a--9ed3--1c7b3e4cf3c4
                                                                                   252:5    0 447.1G  0 lvm
sdf                                                                                  8:80   0   7.3T  0 disk

The affected disk is connected to the sata port that SHOULD be sdb, which it looks like it started as a failed and got later picked up as sdf.

$ dmesg | grep -i "sdb"
Code:
[    1.052108] sd 1:0:0:0: [sdb] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[    1.052111] sd 1:0:0:0: [sdb] 4096-byte physical blocks
[    1.052122] sd 1:0:0:0: [sdb] Write Protect is off
[    1.052124] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.052136] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.052160] sd 1:0:0:0: [sdb] Preferred minimum I/O size 4096 bytes
[    1.068561] sd 1:0:0:0: [sdb] Attached SCSI disk
[   57.493169] sd 1:0:0:0: [sdb] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   57.493171] sd 1:0:0:0: [sdb] tag#2 Sense Key : Illegal Request [current]
[   57.493173] sd 1:0:0:0: [sdb] tag#2 Add. Sense: Unaligned write command
[   57.493175] sd 1:0:0:0: [sdb] tag#2 CDB: Read(16) 88 00 00 00 00 03 a3 81 25 b0 00 00 00 08 00 00
[   57.493177] I/O error, dev sdb, sector 15628051888 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   74.124373] sd 1:0:0:0: [sdb] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124375] sd 1:0:0:0: [sdb] tag#2 Sense Key : Illegal Request [current]
[   74.124377] sd 1:0:0:0: [sdb] tag#2 Add. Sense: Unaligned write command
[   74.124378] sd 1:0:0:0: [sdb] tag#2 CDB: Read(16) 88 00 00 00 00 00 70 55 ff 18 00 00 03 88 00 00
[   74.124379] I/O error, dev sdb, sector 1884684056 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 0
[   74.124391] sd 1:0:0:0: [sdb] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124393] sd 1:0:0:0: [sdb] tag#6 Sense Key : Illegal Request [current]
[   74.124394] sd 1:0:0:0: [sdb] tag#6 Add. Sense: Unaligned write command
[   74.124395] sd 1:0:0:0: [sdb] tag#6 CDB: Read(16) 88 00 00 00 00 00 2d 08 55 30 00 00 03 e8 00 00
[   74.124396] I/O error, dev sdb, sector 755520816 op 0x0:(READ) flags 0x80700 phys_seg 12 prio class 0
[   74.124402] sd 1:0:0:0: [sdb] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124403] sd 1:0:0:0: [sdb] tag#7 Sense Key : Illegal Request [current]
[   74.124404] sd 1:0:0:0: [sdb] tag#7 Add. Sense: Unaligned write command
[   74.124405] sd 1:0:0:0: [sdb] tag#7 CDB: Read(16) 88 00 00 00 00 00 7c af fc 40 00 00 00 48 00 00
[   74.124406] I/O error, dev sdb, sector 2091908160 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   74.124411] sd 1:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124412] sd 1:0:0:0: [sdb] tag#13 Sense Key : Illegal Request [current]
[   74.124413] sd 1:0:0:0: [sdb] tag#13 Add. Sense: Unaligned write command
[   74.124414] sd 1:0:0:0: [sdb] tag#13 CDB: Read(16) 88 00 00 00 00 00 00 10 b5 e0 00 00 00 98 00 00
[   74.124414] I/O error, dev sdb, sector 1095136 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 0
[   74.124419] sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124420] sd 1:0:0:0: [sdb] tag#17 Sense Key : Illegal Request [current]
[   74.124421] sd 1:0:0:0: [sdb] tag#17 Add. Sense: Unaligned write command
[   74.124422] sd 1:0:0:0: [sdb] tag#17 CDB: Read(16) 88 00 00 00 00 00 0e c9 dc 38 00 00 03 88 00 00
[   74.124422] I/O error, dev sdb, sector 248110136 op 0x0:(READ) flags 0x80700 phys_seg 12 prio class 0
[   74.124431] sd 1:0:0:0: [sdb] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[   74.124432] sd 1:0:0:0: [sdb] tag#22 Sense Key : Illegal Request [current]
[   74.124433] sd 1:0:0:0: [sdb] tag#22 Add. Sense: Unaligned write command
[   74.124434] sd 1:0:0:0: [sdb] tag#22 CDB: Read(16) 88 00 00 00 00 00 59 e2 e6 20 00 00 00 08 00 00
[   74.124434] I/O error, dev sdb, sector 1508042272 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  139.382741] sd 1:0:0:0: [sdb] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  139.382744] sd 1:0:0:0: [sdb] tag#16 Sense Key : Illegal Request [current]
[  139.382746] sd 1:0:0:0: [sdb] tag#16 Add. Sense: Unaligned write command
[  139.382747] sd 1:0:0:0: [sdb] tag#16 CDB: Read(16) 88 00 00 00 00 00 00 0f da 90 00 00 00 08 00 00
[  139.382748] I/O error, dev sdb, sector 1038992 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  139.382758] sd 1:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  139.382760] sd 1:0:0:0: [sdb] tag#18 Sense Key : Illegal Request [current]
[  139.382762] sd 1:0:0:0: [sdb] tag#18 Add. Sense: Unaligned write command
[  139.382764] sd 1:0:0:0: [sdb] tag#18 CDB: Read(16) 88 00 00 00 00 00 07 b8 00 b0 00 00 00 a0 00 00
[  139.382765] I/O error, dev sdb, sector 129499312 op 0x0:(READ) flags 0x80700 phys_seg 20 prio class 0
[  139.382775] sd 1:0:0:0: [sdb] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  139.382776] sd 1:0:0:0: [sdb] tag#20 Sense Key : Illegal Request [current]
[  139.382778] sd 1:0:0:0: [sdb] tag#20 Add. Sense: Unaligned write command
[  139.382780] sd 1:0:0:0: [sdb] tag#20 CDB: Read(16) 88 00 00 00 00 00 43 45 0e 80 00 00 03 e0 00 00
[  139.382781] I/O error, dev sdb, sector 1128599168 op 0x0:(READ) flags 0x80700 phys_seg 124 prio class 0
[  139.382792] sd 1:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  139.382793] sd 1:0:0:0: [sdb] tag#21 Sense Key : Illegal Request [current]
[  139.382795] sd 1:0:0:0: [sdb] tag#21 Add. Sense: Unaligned write command
[  139.382795] sd 1:0:0:0: [sdb] tag#21 CDB: Read(16) 88 00 00 00 00 00 59 ca 10 e8 00 00 03 88 00 00
[  139.382796] I/O error, dev sdb, sector 1506414824 op 0x0:(READ) flags 0x80700 phys_seg 34 prio class 0
[  139.382805] sd 1:0:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  139.382807] sd 1:0:0:0: [sdb] tag#23 Sense Key : Illegal Request [current]
[  139.382809] sd 1:0:0:0: [sdb] tag#23 Add. Sense: Unaligned write command
[  139.382811] sd 1:0:0:0: [sdb] tag#23 CDB: Read(16) 88 00 00 00 00 00 2c e5 b3 a0 00 00 00 60 00 00
[  139.382812] I/O error, dev sdb, sector 753251232 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[  155.570005] sd 1:0:0:0: [sdb] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  155.570008] sd 1:0:0:0: [sdb] tag#19 Sense Key : Illegal Request [current]
[  155.570009] sd 1:0:0:0: [sdb] tag#19 Add. Sense: Unaligned write command
[  155.570011] sd 1:0:0:0: [sdb] tag#19 CDB: Read(16) 88 00 00 00 00 00 2c e5 b0 18 00 00 03 88 00 00
[  155.570012] I/O error, dev sdb, sector 753250328 op 0x0:(READ) flags 0x80700 phys_seg 25 prio class 0
[  171.734370] sd 1:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  171.734372] sd 1:0:0:0: [sdb] tag#21 Sense Key : Illegal Request [current]
[  171.734374] sd 1:0:0:0: [sdb] tag#21 Add. Sense: Unaligned write command
[  171.734376] sd 1:0:0:0: [sdb] tag#21 CDB: Read(16) 88 00 00 00 00 00 00 0f d9 f0 00 00 00 a0 00 00
[  171.734377] I/O error, dev sdb, sector 1038832 op 0x0:(READ) flags 0x80700 phys_seg 20 prio class 0
[  293.410650] sd 1:0:0:0: [sdb] tag#25 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  293.410653] sd 1:0:0:0: [sdb] tag#25 Sense Key : Illegal Request [current]
[  293.410655] sd 1:0:0:0: [sdb] tag#25 Add. Sense: Unaligned write command
[  293.410657] sd 1:0:0:0: [sdb] tag#25 CDB: Read(16) 88 00 00 00 00 00 00 00 09 08 00 00 00 d8 00 00
[  293.410658] I/O error, dev sdb, sector 2312 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 0
[  293.410668] sd 1:0:0:0: [sdb] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  293.410669] sd 1:0:0:0: [sdb] tag#26 Sense Key : Illegal Request [current]
[  293.410670] sd 1:0:0:0: [sdb] tag#26 Add. Sense: Unaligned write command
[  293.410671] sd 1:0:0:0: [sdb] tag#26 CDB: Read(16) 88 00 00 00 00 00 00 00 09 e0 00 00 00 20 00 00
[  293.410672] I/O error, dev sdb, sector 2528 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 0
[  293.410675] sd 1:0:0:0: [sdb] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  293.410676] sd 1:0:0:0: [sdb] tag#27 Sense Key : Illegal Request [current]
[  293.410677] sd 1:0:0:0: [sdb] tag#27 Add. Sense: Unaligned write command
[  293.410678] sd 1:0:0:0: [sdb] tag#27 CDB: Read(16) 88 00 00 00 00 00 00 00 0a 08 00 00 00 f8 00 00
[  293.410679] I/O error, dev sdb, sector 2568 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[  349.222215] sd 1:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  349.222218] sd 1:0:0:0: [sdb] tag#14 Sense Key : Illegal Request [current]
[  349.222220] sd 1:0:0:0: [sdb] tag#14 Add. Sense: Unaligned write command
[  349.222221] sd 1:0:0:0: [sdb] tag#14 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 00 20 00 00
[  349.222222] I/O error, dev sdb, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[  365.366726] sd 1:0:0:0: [sdb] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  365.366728] sd 1:0:0:0: [sdb] tag#15 Sense Key : Illegal Request [current]
[  365.366730] sd 1:0:0:0: [sdb] tag#15 Add. Sense: Unaligned write command
[  365.366732] sd 1:0:0:0: [sdb] tag#15 CDB: Read(16) 88 00 00 00 00 00 00 00 08 00 00 00 00 20 00 00
[  365.366733] I/O error, dev sdb, sector 2048 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 0
[  382.450480] sd 1:0:0:0: [sdb] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=16s
[  382.450482] sd 1:0:0:0: [sdb] tag#23 Sense Key : Illegal Request [current]
[  382.450484] sd 1:0:0:0: [sdb] tag#23 Add. Sense: Unaligned write command
[  382.450486] sd 1:0:0:0: [sdb] tag#23 CDB: Read(16) 88 00 00 00 00 03 a3 81 26 f8 00 00 00 08 00 00
[  382.450487] I/O error, dev sdb, sector 15628052216 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  946.463668] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[  946.463687] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

$ dmesg | grep -i "sdf"
Code:
[  997.933765] sd 1:0:0:0: [sdf] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[  997.933768] sd 1:0:0:0: [sdf] 4096-byte physical blocks
[  997.933774] sd 1:0:0:0: [sdf] Write Protect is off
[  997.933776] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[  997.933782] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  997.933802] sd 1:0:0:0: [sdf] Preferred minimum I/O size 4096 bytes
[  997.949464] sd 1:0:0:0: [sdf] Attached SCSI disk
[ 1040.772494] sd 1:0:0:0: [sdf] Synchronizing SCSI cache
[ 1040.772925] sd 1:0:0:0: [sdf] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[ 1040.772927] sd 1:0:0:0: [sdf] 4096-byte physical blocks
[ 1040.772947] sd 1:0:0:0: [sdf] Write Protect is off
[ 1040.772949] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[ 1040.772986] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1040.773050] sd 1:0:0:0: [sdf] Preferred minimum I/O size 4096 bytes
[ 1040.787478] sd 1:0:0:0: [sdf] Attached SCSI disk
[ 1085.112232] sd 1:0:0:0: [sdf] Synchronizing SCSI cache
[ 1085.112620] sd 1:0:0:0: [sdf] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB)
[ 1085.112624] sd 1:0:0:0: [sdf] 4096-byte physical blocks
[ 1085.112632] sd 1:0:0:0: [sdf] Write Protect is off
[ 1085.112634] sd 1:0:0:0: [sdf] Mode Sense: 00 3a 00 00
[ 1085.112646] sd 1:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1085.112670] sd 1:0:0:0: [sdf] Preferred minimum I/O size 4096 bytes
[ 1101.327778] sd 1:0:0:0: [sdf] Attached SCSI disk

$ ceph-volume lvm list
(on the affected host machine)
Code:
====== osd.3 =======

  [block]       /dev/ceph-2d4a209b-2d77-4204-871c-ceea1859489e/osd-block-0e6815a1-c56c-4525-871a-3b005f2eefc2

      block device              /dev/ceph-2d4a209b-2d77-4204-871c-ceea1859489e/osd-block-0e6815a1-c56c-4525-871a-3b005f2eefc2
      block uuid                YKexRR-AIfO-ILI1-6Fwx-yjqF-Bt8O-pcpsr0
      cephx lockbox secret
      cluster fsid              481b1e83-f3e1-43d3-a523-d7d51bc4324f
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  0e6815a1-c56c-4525-871a-3b005f2eefc2
      osd id                    3
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdc

====== osd.5 =======

  [block]       /dev/ceph-180f5e66-6089-4ef5-a4de-4bc8055035d4/osd-block-9b27f285-569f-41c0-aeec-8f2c76ff8344

      block device              /dev/ceph-180f5e66-6089-4ef5-a4de-4bc8055035d4/osd-block-9b27f285-569f-41c0-aeec-8f2c76ff8344
      block uuid                gnUqAP-fPXS-1LWZ-aGDi-TmWA-xf03-RKXGZj
      cephx lockbox secret
      cluster fsid              481b1e83-f3e1-43d3-a523-d7d51bc4324f
      cluster name              ceph
      crush device class        ssd
      encrypted                 0
      osd fsid                  9b27f285-569f-41c0-aeec-8f2c76ff8344
      osd id                    5
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdd

====== osd.8 =======

  [block]       /dev/ceph-9e5ecf2f-7c26-4ed7-863c-e73a28875835/osd-block-35889c97-4620-4414-9e21-5d5bfb577522

      block device              /dev/ceph-9e5ecf2f-7c26-4ed7-863c-e73a28875835/osd-block-35889c97-4620-4414-9e21-5d5bfb577522
      block uuid                jiTXik-MfBZ-G9aE-DCOj-QQHC-k0EG-6FVxNT
      cephx lockbox secret
      cluster fsid              481b1e83-f3e1-43d3-a523-d7d51bc4324f
      cluster name              ceph
      crush device class
      encrypted                 0
      osd fsid                  35889c97-4620-4414-9e21-5d5bfb577522
      osd id                    8
      osdspec affinity
      type                      block
      vdo                       0
      devices                   /dev/sdf
 
If it helps the diagnosis at all, I recently started giving enterprise drives from Server Part Deals a go. This is the second set of drives I've tried from them, I had this issue with the Toshiba drives and thought it was just a Toshiba thing and decided to return them and go Seagate Exos but have the same issue from the same host machine--my old 4790K on an H81i-plus with non-ECC memory. The system didn't have any power outages so I wouldn't figure that as the cause of the disk failure. The smartctl overall is fine, I just haven't yet learned enough to interpret the rest of the information.

$ smartctl -a /dev/sdf

Code:
=== START OF INFORMATION SECTION ===

Device Model:     ST8000NM0105

Serial Number:    ZA17S52Z

LU WWN Device Id: 5 000c50 0a31accc0

Firmware Version: G00B

User Capacity:    8,001,563,222,016 bytes [8.00 TB]

Sector Sizes:     512 bytes logical, 4096 bytes physical

Rotation Rate:    7200 rpm

Form Factor:      3.5 inches

Device is:        Not in smartctl database 7.3/5319

ATA Version is:   ACS-3 T13/2161-D revision 5

SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)

Local Time is:    Tue Mar 25 21:00:49 2025 CDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled


=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED


General SMART Values:

Offline data collection status:  (0x82)    Offline data collection activity

                    was completed without error.

                    Auto Offline Data Collection: Enabled.

Self-test execution status:      (   0)    The previous self-test routine completed

                    without error or no self-test has ever

                    been run.

Total time to complete Offline

data collection:         (  567) seconds.

Offline data collection

capabilities:              (0x7b) SMART execute Offline immediate.

                    Auto Offline data collection on/off support.

                    Suspend Offline collection upon new

                    command.

                    Offline surface scan supported.

                    Self-test supported.

                    Conveyance Self-test supported.

                    Selective Self-test supported.

SMART capabilities:            (0x0003)    Saves SMART data before entering

                    power-saving mode.

                    Supports SMART auto save timer.

Error logging capability:        (0x01)    Error logging supported.

                    General Purpose Logging supported.

Short self-test routine

recommended polling time:      (   1) minutes.

Extended self-test routine

recommended polling time:      ( 779) minutes.

Conveyance self-test routine

recommended polling time:      (   2) minutes.

SCT capabilities:            (0x50bd)    SCT Status supported.

                    SCT Error Recovery Control supported.

                    SCT Feature Control supported.

                    SCT Data Table supported.


SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       194453102

  3 Spin_Up_Time            0x0003   097   089   000    Pre-fail  Always       -       0

  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1955

  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x000f   095   060   045    Pre-fail  Always       -       2919301304

  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       57950 (233 52 0)

 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1898

184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0

187 Reported_Uncorrect      0x0032   094   094   000    Old_age   Always       -       6

188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0

189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0

190 Airflow_Temperature_Cel 0x0022   063   043   040    Old_age   Always       -       37 (Min/Max 37/37)

191 G-Sense_Error_Rate      0x0032   001   001   000    Old_age   Always       -       443042

192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1871

193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       4404

194 Temperature_Celsius     0x0022   037   057   000    Old_age   Always       -       37 (0 15 0 0 0)

195 Hardware_ECC_Recovered  0x001a   006   002   000    Old_age   Always       -       194453102

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       57639 (125 133 0)

241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       5979448219611

242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       6477005088170


SMART Error Log Version: 1

ATA Error Count: 6 (device log contains only the most recent five errors)

    CR = Command Register [HEX]

    FR = Features Register [HEX]

    SC = Sector Count Register [HEX]

    SN = Sector Number Register [HEX]

    CL = Cylinder Low Register [HEX]

    CH = Cylinder High Register [HEX]

    DH = Device/Head Register [HEX]

    DC = Device Command Register [HEX]

    ER = Error register [HEX]

    ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.


Error 6 occurred at disk power-on lifetime: 49757 hours (2073 days + 5 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 ff ff ff 4f 00  29d+13:40:02.350  READ DMA EXT

  25 00 98 ff ff ff 4f 00  29d+13:40:02.339  READ DMA EXT

  25 00 f0 ff ff ff 4f 00  29d+13:40:02.325  READ DMA EXT

  25 00 10 ff ff ff 4f 00  29d+13:40:02.324  READ DMA EXT

  35 00 08 ff ff ff 4f 00  29d+13:40:02.322  WRITE DMA EXT


Error 5 occurred at disk power-on lifetime: 49757 hours (2073 days + 5 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 ff ff ff 4f 00  29d+13:39:58.828  READ DMA EXT

  25 00 00 ff ff ff 4f 00  29d+13:39:58.814  READ DMA EXT

  25 00 00 ff ff ff 4f 00  29d+13:39:58.802  READ DMA EXT

  25 00 08 ff ff ff 4f 00  29d+13:39:58.753  READ DMA EXT

  25 00 08 ff ff ff 4f 00  29d+13:39:58.731  READ DMA EXT


Error 4 occurred at disk power-on lifetime: 21753 hours (906 days + 9 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 ff ff ff 4f 00  29d+20:03:19.251  READ DMA EXT

  25 00 00 ff ff ff 4f 00  29d+20:03:16.837  READ DMA EXT

  35 00 28 20 91 10 40 00  29d+20:03:16.837  WRITE DMA EXT

  25 00 10 ff ff ff 4f 00  29d+20:03:16.823  READ DMA EXT

  47 00 01 e0 00 00 40 00  29d+20:03:16.803  READ LOG DMA EXT


Error 3 occurred at disk power-on lifetime: 21753 hours (906 days + 9 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 ff ff ff 4f 00  29d+20:03:16.837  READ DMA EXT

  35 00 28 20 91 10 40 00  29d+20:03:16.837  WRITE DMA EXT

  25 00 10 ff ff ff 4f 00  29d+20:03:16.823  READ DMA EXT

  47 00 01 e0 00 00 40 00  29d+20:03:16.803  READ LOG DMA EXT

  25 00 00 ff ff ff 4f 00  29d+20:03:13.963  READ DMA EXT


Error 2 occurred at disk power-on lifetime: 21753 hours (906 days + 9 hours)

  When the command that caused the error occurred, the device was active or idle.


  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455


  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 00 ff ff ff 4f 00  29d+20:03:13.963  READ DMA EXT

  25 00 00 ff ff ff 4f 00  29d+20:03:11.124  READ DMA EXT

  25 00 88 ff ff ff 4f 00  29d+20:03:11.118  READ DMA EXT

  25 00 08 ff ff ff 4f 00  29d+20:03:11.112  READ DMA EXT

  25 00 00 ff ff ff 4f 00  29d+20:03:11.111  READ DMA EXT


SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Interrupted (host reset)      00%     57940         -

# 2  Extended offline    Interrupted (host reset)      00%     57940         -

# 3  Short offline       Completed without error       00%     57920         -

# 4  Short offline       Completed without error       00%     57918         -

# 5  Short offline       Completed without error       00%     57917         -

# 6  Short offline       Completed without error       00%     57895         -

# 7  Short offline       Completed without error       00%     50118         -

# 8  Short offline       Completed without error       00%     50010         -

# 9  Short offline       Completed without error       00%     50006         -

#10  Short offline       Completed without error       00%     49999         -

#11  Short offline       Completed without error       00%     22326         -

#12  Short offline       Completed without error       00%         7         -


SMART Selective self-test log data structure revision number 1

 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.
 
I've recently come to realize that my data pools should not have combined SSD and HDDs, is that correct?
(Source: first sentence of the penultimate paragraph in this thread response). I had just replaced some small kingston SSDs with these HDDs and wanted to let things rebalance before I changed my pool/replication rules to use just HDD and just SSD but this osd.8 issue hit before the cluster finished balancing.

$ ceph osd tree
Code:
ID   CLASS  WEIGHT    TYPE NAME      STATUS  REWEIGHT  PRI-AFF
 -1         19.78520  root default
 -3          2.27394      host pve0
  2    ssd   0.90958          osd.2      up   1.00000  1.00000
  6    ssd   0.90958          osd.6      up   1.00000  1.00000
  9    ssd   0.45479          osd.9      up   1.00000  1.00000
 -8          9.09679      host pve2
  8    hdd   7.27739          osd.8    down         0  1.00000
  3    ssd   0.90970          osd.3      up   1.00000  1.00000
  5    ssd   0.90970          osd.5      up   1.00000  1.00000
-13          8.41447      host pve3
  1    hdd   0.68230          osd.1      up   1.00000  1.00000
  4    hdd   0.45479          osd.4      up   1.00000  1.00000
  7    hdd   7.27739          osd.7      up   1.00000  1.00000
 
Ditched the drive, subbed in a couple smaller HDDs to scrape by until I get a replacement drive, and everything eventually balanced back out beautifully. Pools now use only one type of drive instead of a mix. Thank you.
 
  • Like
Reactions: gurubert