Trying (and failing) to add new NVMe drive to Proxmox Server

herbcso

Member
Feb 22, 2021
6
1
23
42
Problem: I install a new NVMe drive (a Sabrent Rocket 4TB) and the boot process starts, but then hangs after a while. I am trying to figure out how to get a successful boot.

Setup: I have a running PVE instance that has its boot drive installed in NVMe slot 2 of my ASUS Prime X570-Pro MoBo. I think I picked slot 2 mainly because it came with a heat sink (it's one of those large ones that cover part of the motherboard). I'm now trying to add another NVMe drive into slot 1 (which I think is actually the faster one), but when it's installed, PVE fails to boot - it just seems to hang after some time, and there is no further output on the console. I suspect it stops outputting anything because I have disabled the graphics drivers from loading since I pass the card though to one of the VMs wholesale - as in, the whole PCI device gets passed to the VM.

I tried booting in recovery mode, but that only gets to this:

Code:
[ OK ] Finished modprobe@configfs, service - Load Kernel Module configfs.
[    3.503716] systemd[1]: modprobeldm_mod, service: Deactivated successfully.

and then I see nothing past that point. Networking also never comes up in either case.

I suspect this might be because the NVMe ID of the original boot drive changes from nvme0n1 to nvme1n1 when the new one is installed in slot 1 - I was able to verify that by booting into a Debian recovery system. However it doesn't completely make sense to me that this would be the cause of the problem, because the UEFI BIOS is still able to identify the boot drive and start the boot process. And the partitions are all on LVM, so I don't think the NVMe ID matters - or does it?

I've tried finding someplace where maybe nvme0n1 is used as an identifier in a config file, but haven't been able to find anything.

I have not yet tried swapping the drives around in their slots - partially because it's kind of a hassle and I'm lazy ;] and partially because I believe the slot where I'm trying to add the new drive is actually the faster one, so I'd like to keep the new, faster drive in that slot.

Question: What could I do to make the boot drive switch? I'd like to use the new drive in slot 1 since I believe it is actually the faster of the two, and I'd like to use it for VM image storage. But I don't know what I need to change (or even where to look) to get the system to boot all the way.
 
When you add/remove PCI(e) devices IOMMU groups can change and perhas you're now passing something through you shouldn't. Using Resource Mapping is supposed to safe you here I think. NIC names also tend to change so if you blacklisted your only GPU you might have trouble fixing anything without visuals/serial/IPMI.
In this case you might have to use the installer iso's CLI (CTRL+ALT+F3) as live system and use chroot similar to this to edit your config files. Lots of fun.
You don't usually rely on names like nvmeX or sdX so that name change shouldn't be a problem.
Maybe you could try to boot with the debug kernel arg and share what you see during boot? Also see here and here.

TLDR: Try to boot with virtualization disabled in the UEFI.
 
Last edited:
Those all sound like really great tips, thank you! At this point I have temporarily removed the new NVMe again so I could boot my server. I may not be able to follow up until next weekend (this is my homelab), so it'll be a while before I can report back, but I will definitely follow up on this, thank you so much!
 
  • Like
Reactions: Impact
FWIW I do use PCI passthrough and have now setup resource mapping for the devices I pass through (and a Zigbee USB stick for Home Assistant). I don't _think_ that was a blocking issue for me, since I never even got to the point where it was trying to start the VMs that use these AFAIK, but it is a very nice thing to have, so thank you for that tip! I didn't even realize that was a thing - was kind of below the fold in the Datacenter UI and I never really explored it much. ;]

The NIC names changing didn't occur to me, so I didn't even check on that. I'll have to look into that when I get a chance to fiddle with this again.

I was able to boot a Debian Installer ISO and mount the root (LVM, not ZFS, silly me not knowing better when I first set this up ;] ) and boot partitions with it. For some reason I was having trouble with the Proxmox live ISO not wanting to boot - but that was also with Ventoy and I didn't bother fighting it when I had Debian on the same USB stick and that worked. Told ya I'm lazy! ;]

Those github gists look like AWESOME info, thank you so much for those! It's a boatload of really useful-looking info, so I'll look to spend some quality time with those! That is some really great stuff and I can see you've spent a lot of time putting this together, so thank you for that resource.
 
OK, this is totally irrelevant to this thread, sorry, but I just had to get this off my chest! You list a command
Bash:
grep -sR "hostpci" /etc/pve
in your gist and I'm all like "`-s` - what does `-s` do!?" and come to find out it suppresses all those silly errors about unreadable files that grep just loves to spit out! Oh... my... GAWD! I've been using Linux off and on for like 35 years now, how in the HECK did I miss this incredibly useful option!? :eek: I've been all `2> /dev/null`-ing this crap when this whole time I coulda just been using `-s`! Wow... Years of my life lost like tears in the rain, typing all that... ;]