Create OSD failed (error after remove/replace failed OSD)

tscibilia

Member
Feb 25, 2020
4
0
6
40
I'm pretty new to ceph so bear with me. I set up a 3 node hyperconverged cluster on proxmox 6.1 with (4) 900gb 10k disks on each and they're communicating on a 10Gb mesh network in broadcast mode. I recently notice a ton of errors on a specific disk (osd.2) and ceph automatically marked it down and out. So I clicked destroy in the GUI and replaced it with a new disk. I tried using the GUI to create a new OSD and it keeps failing. I'm not sure what's wrong or what to do next.

I also noticed this in my status page, it still shows 12 disks and makes reference to osd.2 even though I destroyed it.
Annotation 2020-02-24 225519.jpg

here's the error I'm getting...
Code:
create OSD on /dev/sde (bluestore)
wipe disk/partition: /dev/sde
/bin/dd: fdatasync failed for '/dev/sde': Input/output error
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.15714 s, 66.4 MB/s
command '/bin/dd 'if=/dev/zero' 'bs=1M' 'conv=fdatasync' 'count=200' 'of=/dev/sde'' failed: exit code 1
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 77126751-3c89-4057-abc7-dcc9e7db2fce
Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-9f7f3565-994e-493a-8ac3-3df60cbb55fd /dev/sde
 stderr: Error writing device /dev/sde at 4096 length 4096.
 stderr: bcache_invalidate: block (5, 0) still dirty
  Failed to wipe new metadata area on /dev/sde at 4096 len 4096
  Failed to add metadata area for new physical volume /dev/sde
  Failed to setup physical volume "/dev/sde".
--> Was unable to complete a new OSD, will rollback changes
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.12 --yes-i-really-mean-it
 stderr: 2020-02-24 22:01:09.260 7fb65483f700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2020-02-24 22:01:09.260 7fb65483f700 -1 AuthRegistry(0x7fb650080e78) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: purged osd.12
-->  RuntimeError: command returned non-zero exit status: 5
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 31cf4c55-e5ad-42b3-b122-f5af2b6f4d1d --data /dev/sde' failed: exit code 1

here's my pveversion -v...
root@pxmx1:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.13-2-pve)
pve-manager: 6.1-5 (running version: 6.1-5/9bf06119)
pve-kernel-5.3: 6.1-2
pve-kernel-helper: 6.1-2
pve-kernel-5.3.13-2-pve: 5.3.13-2
pve-kernel-5.3.10-1-pve: 5.3.10-1
ceph: 14.2.6-pve1
ceph-fuse: 14.2.6-pve1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: residual config
ifupdown2: 2.0.1-1+pve2
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-10
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-3
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-2
pve-cluster: 6.1-3
pve-container: 3.0-18
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 4.3.0-1
qemu-server: 6.1-4
smartmontools: 7.1-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.3-pve1
 
So I clicked destroy in the GUI and replaced it with a new disk.
This might not have worked.

/bin/dd: fdatasync failed for '/dev/sde': Input/output error
I suppose this is the new drive?

Since the screenshot shows the osd.2 as outdated and the input/output error of the (new) disk, the issue might be somewhere else.
 
This might not have worked.

Ok, so I tried going off the documentation and used the command line...
Code:
root@pxmx1:~# pveceph osd destroy 2
destroy OSD osd.2
Remove osd.2 from the CRUSH map
Remove the osd.2 authentication key.
Remove OSD osd.2
Unmount OSD osd.2 from  /var/lib/ceph/osd/ceph-2
umount: /var/lib/ceph/osd/ceph-2: not mounted.
command '/bin/umount /var/lib/ceph/osd/ceph-2' failed: exit code 32

I suppose this is the new drive?

Yeah, I'm trying to use a new drive to replace the old one. I forgot to mention that these drives are all connected via SAS to an HBA card (Dell H710 in IT mode). I also did my own 'dd' on the disk and it didn't give me an error dd if=/dev/zero of=/dev/sde conv=fdatasync bs=1M status=progress

Since the screenshot shows the osd.2 as outdated and the input/output error of the (new) disk, the issue might be somewhere else.

Well, after I ran the pveceph osd destroy 2 command the GUI now shows 11 disks (none are 'out' or 'outdated'). So used the command line to issue the pveceph osd create command and it still fails (although this time it's referring to rolling-back the creation of osd.2 instead of osd.12). Should I try a reboot? A different disk? Is there anything else I could do to get this osd up?

Code:
root@pxmx1:~# pveceph osd create /dev/sde
create OSD on /dev/sde (bluestore)
wipe disk/partition: /dev/sde
/bin/dd: fdatasync failed for '/dev/sde': Input/output error
200+0 records in
200+0 records out
209715200 bytes (210 MB, 200 MiB) copied, 3.15264 s, 66.5 MB/s
command '/bin/dd 'if=/dev/zero' 'bs=1M' 'conv=fdatasync' 'count=200' 'of=/dev/sde'' failed: exit code 1
Running command: /bin/ceph-authtool --gen-print-key
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 3e253a68-2208-43ec-b7f9-de655ce99248
Running command: /usr/sbin/vgcreate -s 1G --force --yes ceph-9c77dc73-1258-4ecd-9aea-959c36594829 /dev/sde
stderr: Error writing device /dev/sde at 4096 length 4096.
stderr: bcache_invalidate: block (5, 0) still dirty
  Failed to wipe new metadata area on /dev/sde at 4096 len 4096
  Failed to add metadata area for new physical volume /dev/sde
  Failed to setup physical volume "/dev/sde".
--> Was unable to complete a new OSD, will rollback changes
Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.2 --yes-i-really-mean-it
stderr: 2020-02-25 09:55:11.092 7f447c758700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2020-02-25 09:55:11.092 7f447c758700 -1 AuthRegistry(0x7f4474080e78) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: purged osd.2
-->  RuntimeError: command returned non-zero exit status: 5
TASK ERROR: command 'ceph-volume lvm create --cluster-fsid 31cf4c55-e5ad-42b3-b122-f5af2b6f4d1d --data /dev/sde' failed: exit code 1
 
That for sure will re-initialize all block devices.
I can try that later when I'm physically with the server.

Was there something on the disk already? And what das dmesg show?
See below, I thought the drive was new, but these errors don't look good to me. Can you make sense of this?

Code:
[148417.236429] buffer_io_error: 51190 callbacks suppressed
[148417.236434] Buffer I/O error on dev sde, logical block 0, lost async page write
[148417.237731] Buffer I/O error on dev sde, logical block 1, lost async page write
[148417.238986] Buffer I/O error on dev sde, logical block 2, lost async page write
[148417.240242] Buffer I/O error on dev sde, logical block 3, lost async page write
[148417.241500] Buffer I/O error on dev sde, logical block 4, lost async page write
[148417.242745] Buffer I/O error on dev sde, logical block 5, lost async page write
[148417.243998] Buffer I/O error on dev sde, logical block 6, lost async page write
[148417.245243] Buffer I/O error on dev sde, logical block 7, lost async page write
[148417.246447] Buffer I/O error on dev sde, logical block 8, lost async page write
[148417.247610] Buffer I/O error on dev sde, logical block 9, lost async page write
[148417.249165] sd 0:0:6:0: [sde] tag#4868 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.249169] sd 0:0:6:0: [sde] tag#4868 Sense Key : Hardware Error [current]
[148417.249174] sd 0:0:6:0: [sde] tag#4868 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.249178] sd 0:0:6:0: [sde] tag#4868 CDB: Write(10) 2a 00 00 00 04 00 00 04 00 00
[148417.249183] blk_update_request: critical target error, dev sde, sector 1024 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.333754] sd 0:0:6:0: [sde] tag#4869 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.333778] sd 0:0:6:0: [sde] tag#4869 Sense Key : Hardware Error [current]
[148417.333794] sd 0:0:6:0: [sde] tag#4869 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.333807] sd 0:0:6:0: [sde] tag#4869 CDB: Write(10) 2a 00 00 00 08 00 00 04 00 00
[148417.333816] blk_update_request: critical target error, dev sde, sector 2048 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.338086] sd 0:0:6:0: [sde] tag#4870 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.338103] sd 0:0:6:0: [sde] tag#4870 Sense Key : Hardware Error [current]
[148417.338112] sd 0:0:6:0: [sde] tag#4870 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.338119] sd 0:0:6:0: [sde] tag#4870 CDB: Write(10) 2a 00 00 00 0c 00 00 04 00 00
[148417.338126] blk_update_request: critical target error, dev sde, sector 3072 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.578879] sd 0:0:6:0: [sde] tag#4870 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.578899] sd 0:0:6:0: [sde] tag#4870 Sense Key : Hardware Error [current]
[148417.578909] sd 0:0:6:0: [sde] tag#4870 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.578917] sd 0:0:6:0: [sde] tag#4870 CDB: Write(10) 2a 00 00 01 08 00 00 04 00 00
[148417.578924] blk_update_request: critical target error, dev sde, sector 67584 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.582503] sd 0:0:6:0: [sde] tag#4868 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.582524] sd 0:0:6:0: [sde] tag#4868 Sense Key : Hardware Error [current]
[148417.582532] sd 0:0:6:0: [sde] tag#4868 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.582539] sd 0:0:6:0: [sde] tag#4868 CDB: Write(10) 2a 00 00 01 00 00 00 04 00 00
[148417.582546] blk_update_request: critical target error, dev sde, sector 65536 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.680629] sd 0:0:6:0: [sde] tag#4866 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.680646] sd 0:0:6:0: [sde] tag#4866 Sense Key : Hardware Error [current]
[148417.680656] sd 0:0:6:0: [sde] tag#4866 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.680663] sd 0:0:6:0: [sde] tag#4866 CDB: Write(10) 2a 00 00 00 f8 00 00 04 00 00
[148417.680669] blk_update_request: critical target error, dev sde, sector 63488 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.684223] sd 0:0:6:0: [sde] tag#4864 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.684240] sd 0:0:6:0: [sde] tag#4864 Sense Key : Hardware Error [current]
[148417.684249] sd 0:0:6:0: [sde] tag#4864 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.684256] sd 0:0:6:0: [sde] tag#4864 CDB: Write(10) 2a 00 00 00 f0 00 00 04 00 00
[148417.684263] blk_update_request: critical target error, dev sde, sector 61440 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.925939] sd 0:0:6:0: [sde] tag#4871 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.925957] sd 0:0:6:0: [sde] tag#4871 Sense Key : Hardware Error [current]
[148417.925966] sd 0:0:6:0: [sde] tag#4871 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.925972] sd 0:0:6:0: [sde] tag#4871 CDB: Write(10) 2a 00 00 00 10 00 00 04 00 00
[148417.925979] blk_update_request: critical target error, dev sde, sector 4096 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148417.930324] sd 0:0:6:0: [sde] tag#4872 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148417.930342] sd 0:0:6:0: [sde] tag#4872 Sense Key : Hardware Error [current]
[148417.930350] sd 0:0:6:0: [sde] tag#4872 <<vendor>>ASC=0x81 ASCQ=0x0
[148417.930357] sd 0:0:6:0: [sde] tag#4872 CDB: Write(10) 2a 00 00 00 14 00 00 04 00 00
[148417.930363] blk_update_request: critical target error, dev sde, sector 5120 op 0x1:(WRITE) flags 0x4800 phys_seg 128 prio class 0
[148422.386574] scsi_io_completion_action: 390 callbacks suppressed
[148422.386597] sd 0:0:6:0: [sde] tag#9694 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148422.386601] sd 0:0:6:0: [sde] tag#9694 Sense Key : Hardware Error [current]
[148422.386604] sd 0:0:6:0: [sde] tag#9694 <<vendor>>ASC=0x81 ASCQ=0x0
[148422.386608] sd 0:0:6:0: [sde] tag#9694 CDB: Write(10) 2a 00 00 00 00 00 00 00 10 00
[148422.386609] print_req_error: 390 callbacks suppressed
[148422.386611] blk_update_request: critical target error, dev sde, sector 0 op 0x1:(WRITE) flags 0x8800 phys_seg 2 prio class 0
[148422.483870] sd 0:0:6:0: [sde] tag#4926 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[148422.483880] sd 0:0:6:0: [sde] tag#4926 Sense Key : Hardware Error [current]
[148422.483884] sd 0:0:6:0: [sde] tag#4926 <<vendor>>ASC=0x81 ASCQ=0x0
[148422.483887] sd 0:0:6:0: [sde] tag#4926 CDB: Write(10) 2a 00 00 00 00 00 00 01 00 00
[148422.483889] blk_update_request: critical target error, dev sde, sector 0 op 0x1:(WRITE) flags 0x8800 phys_seg 32 prio class 0
[164924.204880] hrtimer: interrupt took 29014 ns
 
Either the disk itself or some connected part (cable, controller, ...) is broken.
 
Either the disk itself or some connected part (cable, controller, ...) is broken.

any other ideas? i tried rebooting, and a new drive and I'm getting the same errors.
The other drives on the system are connected via the backplane and SAS cables (so I doubt that's the issue). Not to mention I did have a drive working in this configuration at one point.

I may be savvy, but I'm terribly new to ceph so anything else i could try would be welcome.
 
I may be savvy, but I'm terribly new to ceph so anything else i could try would be welcome.
This is not directly related to Ceph. The hardware error is shown directly for the disk sde.

The other drives on the system are connected via the backplane and SAS cables (so I doubt that's the issue). Not to mention I did have a drive working in this configuration at one point.
They might not exhibit this problem, as they didn't fail yet. Either the backplane, cable, controller is damaged or has a firmware issue. Also the new disk might need a firmware update.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!