Ceph OSD woes after NVMe hotplug

jtru

New Member
Sep 11, 2025
10
3
3
johannes.truschnigg.info
We're in the process of validating a PVE cluster setup that will be deployed to prod some time in 2026, and for that purpose, we've spun up the MVC (Minimum Viable Cluster) that mimicks, except in node count, what we're planning to have by then. As a result, we have three modern Dell boxen with 16x NVMe and four 25Gbps interfaces set up with OVS, two 2x25Gbps LACP bonds (one for VM workload traffic and one for ceph storage and migration networking), and used PVE to build ceph cluster to experiment with and see what problems arise during various simulated (near-)catastrophes. We've tested quite a lot of failure scenarios so far, and are indeed very happy with the resilience of the setup and the achieved results :) However, one seemingly minor trouble, detailed below, makes me question why the workaround we quickly found should be necessary in the first place, and what is going on with LVM after re-plugging the NVMe module in general.

One of the tests involves simulating NVMe storage failure by pulling an u.3 module out of its bay while the cluster and some minor write workloads are chugging along. This usually has no adverse effects other than the node kernel and ceph complaining about the obvious problem which we had just induced (pcieport 0000:3e:01.2: pciehp: Slot(167): Link Down, [781600.358999] pcieport 0000:3e:01.2: pciehp: Slot(167): Card not present etc.). Then, after a few minutes we re-plug the removed card, and it is surprisingly difficult to make it re-integrate as an OSD proper into ceph as before (certainly harder and more involved than re-integrating a whole node!)

After plugging the u.3 card in again, it becomes available as a new /dev/nvme<n> generic nvme device node and as the previously registered /dev/nvme<m>n<x> nvme namespace block device. LVM picks up the PV, VG and LV on that namespace, and lvdisplay will report the LV to be read-write and available. However, if we actually want to read data from the device, each read(2) to the resulting fd from a successful open(2) to the device node results in falure (I don't have the strace log handy, but I can re-produce the problem on Monday if the specific error returned by read() would help). What we then need to do to make the LV actually be able to yield data again is to:
  1. Change the whole namespace's VG to offline (vgchange -an ...)
  2. and then switch it to online (vgchange -ay ...)
  3. Only then, after this weird little dance, will dd, xxd et al. be able to read() from the device again.
That's the part that I cannot explain, and that I would be very open to read your theories and musings on. What is going wrong here, and why?

The other half of the failure is that after fixing the LV as described above, a /dev/dm-<x> device node with appropriate symlinks pointing towards it will have been created. Trouble is, the device node proper has different ownership and permissions bits from all the other working ceph LVs: while LVs that weren't subject to our tests are owned by ceph:ceph with 0660 perms, the new, actually readable LV pops out of /dev/ as owned by root:root and 0660, which will make the associated ceph-osd@.service fail, as these insist on the device node be rw for the "ceph" account on the system via its ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph.

We worked around that by performing a quick chown on the shell, but there's several more proper ways to make sure this can't happen and OSD service units fail for this unfortunate reason (an ExecStartPre that "fixes" permissions/ownership, udev rules that fix these permissions, ceph-osd dropping to non-root IDs only after open() on its device nodes, ...) - I wonder if this particular problem is known, and whether any of the proposed fixes would be considered for inclusion?

If there's any more data you need collected/provided to help answer my questions, please let me know :)
 
  • Like
Reactions: mpit
Before re-inserting the NVMe you should remove the remains of the LV which will still be n. the kernel after you pulled it. Only after that the LV will be re-attached cleanly.
This is clearly an edge case as normally you will pull a defective drive and replace it with a new empty one where a new OSD will be created.