[SOLVED] SSD screws up Proxmox boot

Without knowing any specifics, It could be a number of things.
  • The BIOS may have that ssd higher in the boot order than your proxmox. And it is trying to boot a bad OS
  • The SSD could be in your /etc/fstab and causing errors
  • Your 2 TB SSD still has old filesystem signatures, partitions, or metadata (from the troubleshooting days) that Proxmox/Debian automatically scans, mounts, or tries to import (via udev/udisks, blkid, ZFS, LVM, etc.). That scanning causes the host to “crap out” as soon as the drive is detected.
If you don't need what is on the SSD, you may want to put it in another computer, and wipe it.
 
Without knowing any specifics, It could be a number of things.
  • The BIOS may have that ssd higher in the boot order than your proxmox. And it is trying to boot a bad OS
  • The SSD could be in your /etc/fstab and causing errors
  • Your 2 TB SSD still has old filesystem signatures, partitions, or metadata (from the troubleshooting days) that Proxmox/Debian automatically scans, mounts, or tries to import (via udev/udisks, blkid, ZFS, LVM, etc.). That scanning causes the host to “crap out” as soon as the drive is detected.
If you don't need what is on the SSD, you may want to put it in another computer, and wipe it.
  • I made sure that the drive was not in the bios boot order at all
  • I did not see anything in /etc/fstab that specifically referenced this device but I don't know what I am doing
  • The first time I had issues I tried wiping the SSD
fstab
Code:
/dev/pve/root / ext4 errors=remount-ro 0 1
/dev/pve/swap none swap sw 0 0
proc /proc proc defaults 0 0
 
Last edited:
Is the drive an NVMe and are you using PCI(e) passthrough (for other) devices? Adding (or removing) PCI(e) devices can change the PCI ID of the other devices. This can break an existing setup when you add the drive. Also, make sure the drive is in a separate IOMMU group.
Is the drive connected via SATA and did you do a PCI(e) passthrough of a SATA controller? Make sure no other drives are connected to that same controller and make sure there are no other devices in the same IOMMU group.
Or are you trying to do disk passthrough? Make sure to use /dev/disk/by-id or something instead of /dev/sdb as those are not stable.
 
Is the drive an NVMe and are you using PCI(e) passthrough (for other) devices? Adding (or removing) PCI(e) devices can change the PCI ID of the other devices. This can break an existing setup when you add the drive. Also, make sure the drive is in a separate IOMMU group.
Is the drive connected via SATA and did you do a PCI(e) passthrough of a SATA controller? Make sure no other drives are connected to that same controller and make sure there are no other devices in the same IOMMU group.
Or are you trying to do disk passthrough? Make sure to use /dev/disk/by-id or something instead of /dev/sdb as those are not stable.
The drive is in fact an NVMe drive.

"Adding (or removing) PCI(e) devices can change the PCI ID of the other devices. This can break an existing setup when you add the drive. Also, make sure the drive is in a separate IOMMU group."

Not sure how I can fix any of that.
 
The drive is in fact an NVMe drive.

"Adding (or removing) PCI(e) devices can change the PCI ID of the other devices. This can break an existing setup when you add the drive. Also, make sure the drive is in a separate IOMMU group."

Not sure how I can fix any of that.
Don't (auto) start any VMs that use PCI(e) passthrough. Look at the new PCI IDs and in which IOMMU groups they now are. Adjust all the passthrough mappings for all existing VMs accordingly.
 
That will indeed prevent VMs from starting when PCI IDs or IOMMU grouping changes. And you have one place to fix the mappings when you add/enable or remove/disable PCI(e) devices (on-board or add-in).
 
I wonder if the NIC is being affected when the SSD is installed on the motherboard, which could explain the "crapping out" issue, as the network connectivity may cease to function. If that is the case, with the SSD removed from your Proxmox setup, run:
Bash:
pve-network-interface-pinning generate
And that should fix that problem.
 
Fixed! :)

What was confusing me was that I was dealing with two problems. When I first installed the SSD it still had a cloned image of the primary drive. This very much confused the OS. I also had no clue the adding hardware would throw off my passthrough devices.

I started up the machine with the Windows VM disabled and then reordered the hardware in the correct PCI slots. Now I have a VM with both a dedicated GPU and SSD.
 
Yea, PCI device "locations" (BDF addresses such as 0000:01:00.0) can change when you add new PCIe hardware because they are not fixed physical slot numbers. They are dynamically assigned during PCI enumeration by the motherboard's BIOS/UEFI and the Linux kernel. Adding or removing any PCIe device... whether a GPU, NVMe card, or an onboard device enabled through the BIOS, can alter the overall topology.

This is why Proxmox networking would frequently break after a PCI hardware change: a network interface previously identified as enp1s0 might be reassigned to enp3s0, causing the /etc/network/interfaces configuration to no longer match. Similarly, virtual machines that have a device's BDF address hardcoded may attempt to attach to the wrong device after such a change, potentially freezing the entire system.

Also use Resource Mappings.
This doesn't change the problem, as it still maps to a BDF. It does help if that device is used in more than 1 VM. Because you can change it in the Resource Mappings, and it will apply to all the VM's. Resource Mappings really shines when you have more than one of the same device, or are SR-IOV's, and it will pick the first available one.
 
Last edited: