[SOLVED] NVMe Passtrough not working trough M.2 adapter card

drunk.bass

New Member
Jul 7, 2024
2
3
3
Hey folks,

I have a system running on an Asrock X399 Fatality + Threadripper 1950x. It’s been working totally fine and I have no issues with it, knock on wood.

However, a couple of days ago I purchased an Asus M.2 Hyper card to give soem flash to a TrueNAS Scale VM on this host , to create a pool of 4 flash drives in RaidZ1 (drives are Samsung 990 Evo Plus, 2TBs).

I configured everything in BIOS, PCI-E Bifurcation for that first x16 lane to 4x4, NVMe RAID to Off…and I think that’s it. The drives show correctly both in the BIOS as well as on the host.

Now this is the point where I started having issues. I passed trough all 4 drives, and first time i plugged everything in and booted TrueNAS, only 3 out of the 4 drives were shown as available to be used.
Thinking this was a sporadic issue, I tried rebooting the entire host and to my surprise, now only 1 of the drives did shows up. :\

I started tinkering a bit with it (upgrading all packages, trying the 6.11 kernel) and it seems everything should connect correctly, but sadly it does not. I am at the point where none of the drives are recognized by the TrueNAS UI, nor by the OS itself. It seems the nvme driver is just not picking up the drives.

lspci -knn in TrueNAS gives the below:
Code:
03:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a [144d:a80d]
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a [144d:a801]
        Kernel modules: nvme
04:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a [144d:a80d]
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a [144d:a801]
        Kernel modules: nvme

lspci -knn on the host, gives back:
Code:
45:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a80d]
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a801]
        Kernel driver in use: vfio-pci
        Kernel modules: nvme
46:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a80d]
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a801]
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

I only posted the output of 2 drives, but all 4 are there.
It's just like the kernel on the VMs is not able to pick up those 2 drives and use nvme for them.

Ah to mention, in order to exclude TrueNAS being the culprit, I also tested with a clean Ubuntu 24 VM and the results are the same, the drives never make it to lsblk.

While I was playing with blacklisting drives though, I managed to exclude everything and get the host to use them and all 4 of them seemed to be correctly mapped on the host, allowing me to write to them.

Any help with this would be appreciated, I am really clueless on where to go next as this seems to really be a passtrough / misconfiguration issue.
 
Last edited:
Just to give back to the community with the hope someone who encounters the same issue finds this useful later, I managed to find and fix the root cause of this. Skimming journalctl on the proxmox host, I saw 4 lines similar to this:
Code:
kernel: vfio-pci 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible

Altering modprobe to disable the d3cold state finally fixed the issue for me and all 4 drives are now instantly picked up by TN.
Code:
root@lab-02:~# cat /etc/modprobe.d/vfio.conf

options vfio-pci ids=2646:5017,144d:a80d disable_idle_d3=1

^ The last option, disable_idle_d3=1 is the important one.

I'm also not sure why this did it, but I would expect the drives not supporting that low-power state. I would love if someone could maybe confirm that actually was the root cause.
For future reference, drives are Samsung 990 EVO Plus, 2TB.

Have a nice weekend everyone!
 
Last edited:
For future reference, drives are `Samsung 990 EVO Plus, 2TB`.
Just wanted to say a big thank you! I had exactly the same issue.
I have a PCIe expansion card for 4 NVMe drives (from AliExpress) and recently bought the same drives — Samsung 990 EVO Plus 4TB. Everything looked perfect in Proxmox, but when passed through to TrueNAS Scale via PCIe Passthrough, the drives simply disappeared.

What helped me was the following set of steps:
1. First, I listed all my NVMe devices using:
Bash:
root@proxmox:~# lspci -nn | grep -i nvme
01:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. KC3000/FURY Renegade NVMe SSD [E18] [2646:5013] (rev 01)
02:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E16 PCIe4 NVMe Controller [1987:5016] (rev 01)
81:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a80d]
82:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM9C1a (DRAM-less) [144d:a80d]
c1:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] [8086:0a54]

2. Then I edited the VFIO config file:
Bash:
root@proxmox:~# nano /etc/modprobe.d/vfio.conf

And added this line (for all my drives, just to be safe):
Bash:
options vfio-pci ids=2646:5013,1987:5016,144d:a80d,8086:0a54 disable_idle_d3=1

3. After that, I ran
Bash:
root@proxmox:~# update-initramfs -u
and rebooted.
All drives are now immediately visible in TrueNAS Scale and working perfectly.
Thanks again — this saved me a ton of time!
 
  • Like
Reactions: drunk.bass