NVMe/U.2/U.3/etc hot-swap?

BloodyIron

Renowned Member
Jan 14, 2013
349
48
93
I can't find current insights into how PVE handles hot-swapping NVMe and related PCIe storage devices during a system's live operations. I was able to find some older threads mentioning IOMMU and other similar problems when trying to remove/install an NVMe device while a PVE node is online, but I don't know if that's still the case, let alone if better methods in PVE can solve this.

To me it seems antithetical to not be able to hot-swap NVMe/U.2/U.3/similar storage devices, or not even install new ones while PVE is running. I don't know if this is truly how it is now in PVE or not, so hoping some insights can be shared from PVE devs or something like that.

Consider for a moment that these class of storage devices are substantially affordable now (putting aside the RAMpocalypse going on right now) and they are commonplace in new server builds. The functional desire and need to replace failing NVMe/U.2/U.3/etc storage devices _without turning the PVE server off_ is a commonplace expectation and function, the same as SAS/SATA. And the same functional desire and need to install new ones (into unused bays/slots) exists without turning the PVE server off.

So, what exactly is the current reality of this aspect for Proxmox VE? If it's in a undesirable state, what's preventing it from being a rock-solid capability at this point?

I do not see the need for this going away, ever, and only ever getting bigger.
 
The hot-swappability of NVMe disks is not directly controlled by PVE itself. It is primarily a function of the underlying hardware (CPU, motherboard, chassis, BIOS/firmware) and software (kernel), as well as the ability of userland components such as ZFS, MD RAID, and similar tools to properly handle newly added or unexpectedly removed devices.

The kernel version and flavor used by PVE should be fully capable of handling surprise NVMe removal and insertion events. The remaining factors are largely outside of PVE’s control and depend on your hardware platform and storage stack configuration.

Enterprise servers/hardware in this day and age always include proper PCIe hot-plug support at the backplane and firmware level, so as long as you are not using consumer hardware you should be good!

Cheers


Blockbridge : Ultra low latency all-NVME shared storage for Proxmox - https://www.blockbridge.com/proxmox