Ceph OSD using wrong device identifier

naed3r

New Member
Nov 12, 2024
2
0
1
Hello!
I have been messing around with ceph to see if it will properly augment my NAS in a small subset of tasks, but I noticed if a disk is removed and put back in the ceph cluster doesn't detect that until reboot. This is because it is defined using the /dev/sdX format instead of the /dev/disk/by-{uuid|id} format which is (to my knowledge) more stable. Is this designed like that on purpose? I understand that you shouldn't be messing with disks like that but its still a problem in my opinion.
Thanks!
Nate
 
1. Do you use LUKS ?
2. What dmesg reports in this situation? Does the drive get the same name sdX or another sdY ?
3. What you see for LVM reports in dmesg ?
 
Hi!
I don't use LUKS,
dmesg reports:
Code:
[4293251.501646] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221106000000)
[4293251.501660] mpt2sas_cm0: removing handle(0x000b), sas_addr(0x4433221106000000)
[4293251.501669] mpt2sas_cm0: enclosure logical id(0x500605b123456777), slot(6)
[4293263.475809] mpt2sas_cm0: handle(0xb) sas_address(0x4433221106000000) port_type(0x1)
[4293263.987013] scsi 0:0:12:0: Direct-Access     ATA      Crucial_CT1050MX R021 PQ: 0 ANSI: 6
[4293263.987053] scsi 0:0:12:0: SATA: handle(0x000b), sas_addr(0x4433221106000000), phy(6), device_name(0x0000000000000000)
[4293263.987060] scsi 0:0:12:0: enclosure logical id (0x500605b123456777), slot(6)
[4293263.987163] scsi 0:0:12:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[4293263.987172] scsi 0:0:12:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[4293264.122092] sd 0:0:12:0: Attached scsi generic sg2 type 0
[4293264.122298]  end_device-0:12: add: handle(0x000b), sas_addr(0x4433221106000000)
[4293264.122321] sd 0:0:12:0: Power-on or device reset occurred
[4293264.133781] sd 0:0:12:0: [sdj] 2051200368 512-byte logical blocks: (1.05 TB/978 GiB)
[4293264.206453] sd 0:0:12:0: [sdj] Write Protect is off
[4293264.206468] sd 0:0:12:0: [sdj] Mode Sense: 7f 00 10 08
[4293264.218619] sd 0:0:12:0: [sdj] Write cache: enabled, read cache: enabled, supports DPO and FUA
[4293269.141441] sd 0:0:12:0: [sdj] Attached SCSI disk
The drive originally had sdc, but I took it out, cleaned the dust off it, and put it back in, and it came online as sdj.
and there are no LVM reports in dmesg
 
If you want to remove temporary disk you have to:

1. shutdown OSD
2. unmount all related mount points ( like /var/lib/ceph/osd/osd-X )
3. release who is holding sdc ( encryption / LVM )
4. unplug the disk

This way disk could get the same name as before and LVM scan could import it and `systemctl start ceph-volume@ID` could do the startup thing

If disk is removed without proper shutdown then:

1. OSD should die in write action, if not stop it
2. close / rescan LVM to make sure no LVM is not holding ID (you can look at `ls -la /var/lib/ceph/osd/osd-X/block' to know LVM/LUKS ID )
3. unmount /var/lib/ceph/osd/osd-X
4. plug in disk, rescan LVM -> pvscan -> vgchange -ay
5. start `systemctl start ceph-volume@ID`
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!