Ceph OSD using wrong device identifier

naed3r · Nov 12, 2024

Hello!
I have been messing around with ceph to see if it will properly augment my NAS in a small subset of tasks, but I noticed if a disk is removed and put back in the ceph cluster doesn't detect that until reboot. This is because it is defined using the /dev/sdX format instead of the /dev/disk/by-{uuid|id} format which is (to my knowledge) more stable. Is this designed like that on purpose? I understand that you shouldn't be messing with disks like that but its still a problem in my opinion.
Thanks!
Nate

Nemesiz · Nov 12, 2024

1. Do you use LUKS ?
2. What dmesg reports in this situation? Does the drive get the same name sdX or another sdY ?
3. What you see for LVM reports in dmesg ?

naed3r · Nov 12, 2024

Hi!
I don't use LUKS,
dmesg reports:

Code:

[4293251.501646] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[4293251.501660] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221106000000)
[4293251.501669] mpt2sas_cm0: enclosure logical id(0x500605b123456777), slot(6)
[4293263.475809] mpt2sas_cm0: handle(0xb) sas_address(0x4433221106000000) port_type(0x1)
[4293263.987013] scsi 0:0:12:0: Direct-Access     ATA      Crucial_CT1050MX R021 PQ: 0 ANSI: 6
[4293263.987053] scsi 0:0:12:0: SATA: handle(0x000b), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[4293263.987060] scsi 0:0:12:0: enclosure logical id (0x500605b123456777), slot(6)
[4293263.987163] scsi 0:0:12:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[4293263.987172] scsi 0:0:12:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[4293264.122092] sd 0:0:12:0: Attached scsi generic sg2 type 0
[4293264.122298]  end_device-0:12: add: handle(0x000b), sas_addr(0x4433221106000000)
[4293264.122321] sd 0:0:12:0: Power-on or device reset occurred
[4293264.133781] sd 0:0:12:0: [sdj] 2051200368 512-byte logical blocks: (1.05 TB/978 GiB)
[4293264.206453] sd 0:0:12:0: [sdj] Write Protect is off
[4293264.206468] sd 0:0:12:0: [sdj] Mode Sense: 7f 00 10 08
[4293264.218619] sd 0:0:12:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[4293269.141441] sd 0:0:12:0: [sdj] Attached SCSI disk

The drive originally had sdc, but I took it out, cleaned the dust off it, and put it back in, and it came online as sdj.
and there are no LVM reports in dmesg

Nemesiz · Nov 13, 2024

If you want to remove temporary disk you have to:

1. shutdown OSD
2. unmount all related mount points ( like /var/lib/ceph/osd/osd-X )
3. release who is holding sdc ( encryption / LVM )
4. unplug the disk

This way disk could get the same name as before and LVM scan could import it and `systemctl start ceph-volume@ID` could do the startup thing

If disk is removed without proper shutdown then:

1. OSD should die in write action, if not stop it
2. close / rescan LVM to make sure no LVM is not holding ID (you can look at `ls -la /var/lib/ceph/osd/osd-X/block' to know LVM/LUKS ID )
3. unmount /var/lib/ceph/osd/osd-X
4. plug in disk, rescan LVM -> pvscan -> vgchange -ay
5. start `systemctl start ceph-volume@ID`

Search

Search

Ceph OSD using wrong device identifier

naed3r

New Member

Nemesiz

Renowned Member

naed3r

New Member

Nemesiz

Renowned Member

We value your privacy