PCIe-Passthrough no Longer Working on PVE 9.0.3 with kernel 6.14 - VM Hangs on Start

seiichiro0185

Active Member
May 17, 2018
9
1
43
40
Today I updated my Homelab Cluster to PVE 9, all seems to work fine, with one major exception. I have VM which runs my NAS and has PCIe-Passthrough configured for the SATA-Controller and the Intel iGPU (via https://github.com/strongtz/i915-sriov-dkms)

Because of the i915-sriov-dkms I had my Kernel pinned to 6.8.12-13-pve, which worked fine after the update. Once I realized this, I unpinned the Kernel, and rebootet with the "real" PVE 9 Kernel 6.14.8-2-pve.
Booted with the 6.14, my NAS-VM no longer starts, as long as either of the PCIe Devices (SATA-Controller or iGPU) is attached. The start command seems to work, but then the VM just hangs. Console doesn't work (timeout) trying to run any monitor command also produces a timeout, VM also never reaches a state where it is reachable over the network. If I remove all PCIe Passhtrough from the VM, it boots again.

I don't see any obvious errors in the logs of the node, other than the timeouts like this when I try to do anything with the "starting" VM:
Code:
VM 200 qmp command 'human-monitor-command' failed - unable to connect to VM 200 qmp socket - timeout after 242 retries

I then re-pinned the kernel to 6.8.12 and the VM worked again after a reboot of the node, so thats a workaround for now. But I still would like to find out what the problem is when running on 6.14.

At first I suspected the i915-sriov drivers, but it also doesn't work with just the SATA-Controller passed through. Also according to dmesg / journal the i915-sriov driver loads just fine on 6.14 as well.

Attached are the journalctl logs for both kernels, maybe someone sees anything in there

Hardware is a Odroid H4 Ultra, 64G RAM and dual NVME Adapter with Samsung 980 1TB SSDs.

Here is the pveversion -v output of the node:
Code:
proxmox-ve: 9.0.0 (running kernel: 6.8.12-13-pve)
pve-manager: 9.0.3 (running version: 9.0.3/025864202ebb6109)
proxmox-kernel-helper: 9.0.3
proxmox-kernel-6.14.8-2-pve-signed: 6.14.8-2
proxmox-kernel-6.14: 6.14.8-2
proxmox-kernel-6.8.12-13-pve-signed: 6.8.12-13
proxmox-kernel-6.8: 6.8.12-13
proxmox-kernel-6.8.12-9-pve-signed: 6.8.12-9
ceph-fuse: 19.2.3-pve1
corosync: 3.1.9-pve2
criu: 4.1.1-1
frr-pythontools: 10.3.1-1+pve4
ifupdown2: 3.3.0-1+pmx9
intel-microcode: 3.20250512.1
ksm-control-daemon: 1.5-1
libjs-extjs: 7.0.0-5
libproxmox-acme-perl: 1.7.0
libproxmox-backup-qemu0: 2.0.1
libproxmox-rs-perl: 0.4.1
libpve-access-control: 9.0.3
libpve-apiclient-perl: 3.4.0
libpve-cluster-api-perl: 9.0.6
libpve-cluster-perl: 9.0.6
libpve-common-perl: 9.0.9
libpve-guest-common-perl: 6.0.2
libpve-http-server-perl: 6.0.4
libpve-network-perl: 1.1.6
libpve-rs-perl: 0.10.7
libpve-storage-perl: 9.0.13
libspice-server1: 0.15.2-1+b1
lvm2: 2.03.31-2
lxc-pve: 6.0.4-2
lxcfs: 6.0.4-pve1
novnc-pve: 1.6.0-3
proxmox-backup-client: 4.0.11-1
proxmox-backup-file-restore: 4.0.11-1
proxmox-backup-restore-image: 1.0.0
proxmox-firewall: 1.1.1
proxmox-kernel-helper: 9.0.3
proxmox-mail-forward: 1.0.2
proxmox-mini-journalreader: 1.6
proxmox-offline-mirror-helper: 0.7.0
proxmox-widget-toolkit: 5.0.5
pve-cluster: 9.0.6
pve-container: 6.0.9
pve-docs: 9.0.8
pve-edk2-firmware: 4.2025.02-4
pve-esxi-import-tools: 1.0.1
pve-firewall: 6.0.3
pve-firmware: 3.16-3
pve-ha-manager: 5.0.4
pve-i18n: 3.5.2
pve-qemu-kvm: 10.0.2-4
pve-xtermjs: 5.5.0-2
qemu-server: 9.0.16
smartmontools: 7.4-pve1
spiceterm: 3.4.0
swtpm: 0.8.0+pve2
vncterm: 1.9.0
zfsutils-linux: 2.3.3-pve1

Also the config of the affected VM:
Code:
agent: 1
balloon: 16384
bios: ovmf
boot: order=scsi0;ide2;net0
cores: 6
cpu: host
efidisk0: vmdisks:vm-200-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: mapping=sata-controller,pcie=1
hostpci1: mapping=intel-igpu-1,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 24576
meta: creation-qemu=9.2.0,ctime=1752392553
name: nas
net0: virtio=BC:24:11:XX:XX:XX,bridge=lan
net1: virtio=BC:24:11:XX:XX:XX,bridge=vmbr0,tag=14
numa: 0
onboot: 1
ostype: l26
protection: 1
scsi0: vmdisks:vm-200-disk-1,discard=on,iothread=1,size=32G,ssd=1
scsi1: vmdisks:vm-200-disk-2,discard=on,iothread=1,size=64G,ssd=1
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=07df3dbf-d245-4c58-8a24-3a9d087acb51
sockets: 1
startup: order=1
usb0: mapping=tvcard
usb1: mapping=audiocard
vga: std
vmgenid: c9e5fb0f-369c-4ed1-9a0d-35cce3ed800a

And corresponding device mappings:
Code:
pvesh get /cluster/mapping/pci
┌─────────────┬─────────────────┬──────────────────────────────────────────────────────────────────────────────────────┬────────┐
│ description │ id              │ map                                                                                  │ checks │
╞═════════════╪═════════════════╪══════════════════════════════════════════════════════════════════════════════════════╪════════╡
│             │ intel-igpu-6    │ ["id=8086:46d0,iommugroup=23,node=pve01-c,path=0000:00:02.6,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-4    │ ["id=8086:46d0,iommugroup=21,node=pve01-c,path=0000:00:02.4,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-2    │ ["id=8086:46d0,iommugroup=19,node=pve01-c,path=0000:00:02.2,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-5    │ ["id=8086:46d0,iommugroup=22,node=pve01-c,path=0000:00:02.5,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-7    │ ["id=8086:46d0,iommugroup=24,node=pve01-c,path=0000:00:02.7,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-1    │ ["id=8086:46d0,iommugroup=18,node=pve01-c,path=0000:00:02.1,subsystem-id=8086:2212"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ sata-controller │ ["id=1b21:1064,iommugroup=15,node=pve01-c,path=0000:03:00.0,subsystem-id=1b21:2116"] │        │
├─────────────┼─────────────────┼──────────────────────────────────────────────────────────────────────────────────────┼────────┤
│             │ intel-igpu-3    │ ["id=8086:46d0,iommugroup=20,node=pve01-c,path=0000:00:02.3,subsystem-id=8086:2212"] │        │
└─────────────┴─────────────────┴──────────────────────────────────────────────────────────────────────────────────────┴────────┘

If anyone has an Idea what might be the problem, I'd appreciate any hints. Also If you need additional logs or info or know of ways to get more debug output from the VM start, let me know.
 

Attachments

Hello,

Just wanted to let you know you are not alone. In my case it was exacly the same, SATA controller hanging node, iGPU passthrough was working just fine.

I just reverted back to 8.4 on all my upgraded nodes. Following for potential solution.

EDIT: I've just read in another thread that someone has problem with PVE9 and N100 iGPU and someone saying it is not supported for now. You might have two different issues here if that board has N305. I'm on i5 10400.


 
Last edited:
Thanks for your answer. It seems to be a similar problem like described in your second mentioned thread. As I fortunately have access to a second Odroid H4 I have created a Test Install to experiment without impacting my "production" NAS. I Installed the latest PVE 8.4 with kernel 6.8, recreated my VM with SATA-controller and iGPU (with i915-sriov modules) passthrough. As expected everything worked fine.

I then installed the opt-In 6.14 kernel on the PVE 8.4 install and the problem reappeared. VM doesn't start, any interaction with it after the first start attempt hangs / runs into a timeout. Stopping seems to work at a first glance (although it resorts to SIGKILL) but some remains of the VM are still there, so it can't be startet again until the PVE Host was fully rebooted.

Code:
root@pve02-a:~# qm monitor 100
Entering QEMU Monitor for VM 100 - type 'help' for help
qm> help
ERROR: VM 100 qmp command 'human-monitor-command' failed - unable to connect to VM 100 qmp socket - timeout after 301 retries
qm>
root@pve02-a:~# qm stop 100
VM quit/powerdown failed - terminating now with SIGTERM
VM still running - terminating now with SIGKILL
root@pve02-a:~# qm start 100
timeout waiting on systemd

After a reboot of the PVE-Host I then removed the SATA-Controller (Asmedia ASM1064 in my case) from the passthrough, but left the i915 VF passthrough there -> VM works again (I thought I tested that initially, but seems like I didn't).

So the current state seems to be like this:

PVE 8.4 + Kernel 6.8 -> Passthrough of the ASM1046 works
PVE 8.4 + Kernel 6.14 -> Passthrough of the ASM1046 doesn't work
PVE 9.0 + Kernel 6.8 -> Passthrough of the ASM1046 works
PVE 9.0 + Kernel 6.14 -> Passthrough of the ASM1046 doesn't work

iGPU passthrough with the i915-sriov modules works fine in all combinations

I also checked if there was any change in iommu groups etc., but it stays the same between kernel 6.8 and 6.14. If there are more tests that could help feel free to ask, as I have a non-production system to experiment now.
 
I had the same issue. I was upating my PVE 8.4 to 9.0.3 and on my Lenovo M715Q the system freeze after some minutes. I already prepared a USB Stick to do a install of 8.4 again when I tried to pin the system to kernel 6.8.12-13 which works fine and no freeze. I´m not sure what is causing the issues, but for me it seems that kernel 6.14 is causing the issues.
 
TL;DR: 6.8.12-13-pve Passthrough ASM1061 works

I can confirm the same issue. After upgrading to 9.0.3 from 8.4 my VM with Asmedia ASM1061 SATA-Controller using pci-passthrough would hard crash the host on start causing immediate reboot. Oddly enough, when I pinned 6.8.12-13-pve it did not fix it. I re-installed again clean with PVE 9 and continued to have the issue.

However, coming across this post after some searching led me to try pinning again. Luckily I still have a node on 8.4 so I copied the proxmox-kernel deb files from that node's /var/cache/apt/archive to the 9.0 and installed via dpkg. After pinning this kernel and rebooting pci-passthrough works again!

Not sure what went wrong the first time I pinned.

I didn't see an official buzilla report of this issue. I'm hopeful someone who can afford to do more testing and interaction will open one. I'd do it myself however, I don't have this issue except in a production environment :'(

Kernel by date, latest first.

EDIT | Updated 2025/09/02
6.8.12-13-pve : works
6.14.8-2-pve : fails
6.14.11-1-pve : fails
 
Last edited:
I am also having this issue. Let me know if i can help with logs or testing. (ASM1166)
 
Last edited:
I don't see any obvious errors in the logs of the node, other than the timeouts like this when I try to do anything with the "starting" VM:
The log just states that it cannot connect to the PBS server, is it offline?

Code:
pve01-c pvestatd[3595]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 50 retries
pve01-c pvestatd[3595]: PBS: error fetching datastores - 500 Can't connect to 10.45.14.112:8007 (No route to host)

Either way, did anything change in the output of lspci -nnkvvv between the two kernel versions?
 
The log just states that it cannot connect to the PBS server, is it offline?

Code:
pve01-c pvestatd[3595]: VM 200 qmp command failed - VM 200 qmp command 'query-proxmox-support' failed - unable to connect to VM 200 qmp socket - timeout after 50 retries
pve01-c pvestatd[3595]: PBS: error fetching datastores - 500 Can't connect to 10.45.14.112:8007 (No route to host)
That ist expected behavior actually, since the PBS is running on the non-starting VM ;)

Either way, did anything change in the output of lspci -nnkvvv between the two kernel versions?
No significant changes as far as I can see, other than a few IRQ and Address differences. Used drivers etc. stayed the same. Both lspci outputs from my test machine are attached for reference.

Just FYI, there is another user reporting problems with an ASM1166 between these kernel versions in [0], but AFAICT the errors are not the same.

[0] https://forum.proxmox.com/threads/asm1166-issues-with-pve-9-kernel-6-14-11-1-pve.170905/
I tried the kernel parameter from this thread on my test machine for good measure, but the error/behavior stays the same. So yes, seems to be a different error.
EDIT: I spoke to soon apparently. The VM seems to be starting now when this Kernel Parameter is added. I'll have to test a few more things to be sure though

EDIT2: No, apparently some kind of fluke, or I accidentially booted the wrong kernel after adding the parameter. Adding the libata.force=nolpm parameter does not change the behavior on Kernel 6.14.
 

Attachments

Last edited:
I also tried the libata.force=nolpm and this did not work for me and I am running a asm1166. I also saw no differences in lspci. I'm also only using this as a backup so it only has 1 VM and 1 LXC on the machine... it doesn't get more bare bones.
 
Hi,
Just to say, as already reported, that even here with the Intel i7 6700T HD530 iGPU, passthrough no longer works. Windows 11 sees the iGPU but still gives error 43, as other users with the same problem also have with other CPUs.
Hopefully, there's a fix.....
 
I'm also not getting exactly the same logs as seiichiro0185. This is my journal output:
Code:
Sep 03 13:48:15 pve-backup pvedaemon[1549]: VM 700 started with PID 1596.
Sep 03 13:48:15 pve-backup kernel: vfio-pci 0000:05:00.0: reset done
Sep 03 13:48:15 pve-backup kernel: vfio-pci 0000:05:00.0: resetting
Sep 03 13:48:15 pve-backup kernel: vfio-pci 0000:05:00.0: reset done
Sep 03 13:48:15 pve-backup kernel: vfio-pci 0000:05:00.0: resetting
Sep 03 13:48:12 pve-backup kernel: kauditd_printk_skb: 115 callbacks suppressed
Sep 03 13:48:12 pve-backup systemd[1]: Started 700.scope.
Sep 03 13:48:12 pve-backup systemd[1]: Created slice qemu.slice - Slice /qemu.
Sep 03 13:48:12 pve-backup kernel: vfio-pci 0000:05:00.0: reset done
Sep 03 13:48:12 pve-backup kernel: vfio-pci 0000:05:00.0: resetting
Sep 03 13:48:11 pve-backup pvedaemon[1549]: start VM 700: UPID:pve-backup:0000060D:000011EA:68B87F5B:qmstart:700:root@pam:

That being said the result is the same... nothing in dmesg. It just hangs and goes into limbo... I verified the firmware that was pointed out on another post and mine is compliant (Firmware 211108-0000-00) with what should work. I've also verified that if i remove the card from my VM it starts as it should, so its absolutely tied to the card. I can also confirm that it works in 6.8 but not 6.14. As i stated earlier i tried the libata.force=nolpm, but i didnt understand this as the firmware update was supposed to correct the lpm functions and the card shows with Power Management Enabled:
Code:
05:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02) (prog-if 01 [AHCI 1.0])
        Subsystem: ZyDAS Technology Corp. Device [2116:2116]
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 82
        IOMMU group: 16
        Region 0: Memory at fcf82000 (32-bit, non-prefetchable) [size=8K]
        Region 5: Memory at fcf80000 (32-bit, non-prefetchable) [size=8K]
        Expansion ROM at fcf00000 [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [80] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x2
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis+
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked+ DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr+ HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [130 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: vfio-pci
        Kernel modules: ahci
BTW that output is 6.14.11-1
This one is 6.8.12-13
Code:
05:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02) (prog-if 01 [AHCI 1.0])
        Subsystem: ZyDAS Technology Corp. Device [2116:2116]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 255
        IOMMU group: 15
        Region 0: Memory at fcf82000 (32-bit, non-prefetchable) [size=8K]
        Region 5: Memory at fcf80000 (32-bit, non-prefetchable) [size=8K]
        Expansion ROM at fcf00000 [disabled] [size=512K]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D3 NoSoftRst- PME-Enable+ DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [80] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75W TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x2, ASPM L0s L1, Exit Latency L0s <4us, L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x2
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis- NROPrPrP- LTR-
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis+
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked+ DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr+ HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [130 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: vfio-pci
        Kernel modules: ahci

So in my case both the IOMMU group and the IRQ change. But the device ID stays the same. Again i have tried removing and re-adding that has no effect.

Let me add that this is my syslog under 6.8:
Code:
Sep 03 14:23:47 pve-backup pvedaemon[2925]: start VM 700: UPID:pve-backup:00000B6D:0000CA34:68B887B3:qmstart:700:root@pam:
Sep 03 14:23:47 pve-backup pvedaemon[1326]: <root@pam> starting task UPID:pve-backup:00000B6D:0000CA34:68B887B3:qmstart:700:root@pam:
Sep 03 14:23:47 pve-backup systemd[1]: Created slice qemu.slice - Slice /qemu.
Sep 03 14:23:47 pve-backup systemd[1]: Started 700.scope.
Sep 03 14:23:51 pve-backup pvedaemon[2925]: VM 700 started with PID 2973.

So i'm seeing a bunch of controller resets that seem to leave the controller in limbo when on 6.14 that i dont see on 6.8
 
Last edited:
I had a very similar issue when I upgraded to v8.2. In that case, the whole proxmox host would not boot, and it turned out to be a auto starting VM with a GPU passthrough.

I 'fixed' it by pinning the kernel version ( forget which version at the time), but some weeks later I came back to the problem to see if I could find a solution.
Someone on the forums suggested I remov the GPU passthrough to see if that was the issue (which it was).
I then deleted the entry via the gui and re-enabled it and the issue went away...
I was then able to unpin the kernel, and its been fine since.
 
I didn't try libata.force=nolpmkernel option as I didn't have a big enough window for downtime this morning. However I was able to capture the journals for both working and non-working kernels as well as the lspci output of both. I also snagged the console crash screen (`journalctl -b -1` doesn't show much).

EDIT2: I was able to test with libata.force=nolpm this morning (2025/09/05) and I get identical results. This option doesn't have any effect.

lpsci -nnkvvv [6.14.11-1-pve]
Code:
04:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller [1b21:0612] (rev 02) (prog-if 01 [AHCI 1.0])
    Subsystem: ASMedia Technology Inc. Device [1b21:1060]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 33
    IOMMU group: 15
    Region 0: I/O ports at ecd0 [size=8]
    Region 1: I/O ports at ecc8 [size=4]
    Region 2: I/O ports at ecd8 [size=8]
    Region 3: I/O ports at eccc [size=4]
    Region 4: I/O ports at ece0 [size=32]
    Region 5: Memory at df1ff000 (32-bit, non-prefetchable) [size=512]
    Expansion ROM at df100000 [disabled] [size=64K]
    Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee11000  Data: 0021
    Capabilities: [78] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [80] Express (v2) Legacy Endpoint, IntMsgNum 0
        DevCap:    MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
        DevCtl:    CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #1, Speed 5GT/s, Width x1, ASPM not supported
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
        LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x1
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-
             AtomicOpsCtl: ReqEn-
             IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
             10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [100 v1] Virtual Channel
        Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:    ArbSelect=Fixed
        Status:    InProgress-
        VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status:    NegoPending- InProgress-
    Kernel driver in use: ahci
    Kernel modules: ahci

lspci -nnkvvv [6.8.12-13-pve]
Code:
04:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1061/ASM1062 Serial ATA Controller [1b21:0612] (rev 02) (prog-if 01 [AHCI 1.0])
    Subsystem: ASMedia Technology Inc. Device [1b21:1060]
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 0, Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 33
    IOMMU group: 15
    Region 0: I/O ports at ecd0 [size=8]
    Region 1: I/O ports at ecc8 [size=4]
    Region 2: I/O ports at ecd8 [size=8]
    Region 3: I/O ports at eccc [size=4]
    Region 4: I/O ports at ece0 [size=32]
    Region 5: Memory at df1ff000 (32-bit, non-prefetchable) [size=512]
    Expansion ROM at df100000 [disabled] [size=64K]
    Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Address: fee11000  Data: 0021
    Capabilities: [78] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [80] Express (v2) Legacy Endpoint, IntMsgNum 0
        DevCap:    MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
        DevCtl:    CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
            MaxPayload 256 bytes, MaxReadReq 512 bytes
        DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
        LnkCap:    Port #1, Speed 5GT/s, Width x1, ASPM not supported
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
        LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta:    Speed 5GT/s, Width x1
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Range ABC, TimeoutDis+ NROPrPrP- LTR-
             10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
             EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
             FRS-
             AtomicOpsCap: 32bit- 64bit- 128bitCAS-
        DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-
             AtomicOpsCtl: ReqEn-
             IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
             10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
             EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
             Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [100 v1] Virtual Channel
        Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
        Arb:    Fixed- WRR32- WRR64- WRR128-
        Ctrl:    ArbSelect=Fixed
        Status:    InProgress-
        VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
            Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
            Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
            Status:    NegoPending- InProgress-
    Kernel driver in use: vfio-pci
    Kernel modules: ahci

diff -Nur lspci-6.8.12-13-pve.txt lspci-6.14.11-1-pve.txt (bridge device changes mostly. The last diff at line 1068 shows asmedia driver change)
EDIT 1: Actually this makes sense since this device is not successfully being bound (yet) for my VM, while 6.8 bound successfully and is running normal.
Code:
diff -Nur lspci-6.8.12-13-pve.txt lspci-6.14.11-1-pve.txt
--- lspci-6.8.12-13-pve.txt    2025-09-04 06:46:41.063290463 -0400
+++ lspci-6.14.11-1-pve.txt    2025-09-04 06:27:07.702843167 -0400
@@ -76,7 +76,7 @@
     I/O behind bridge: f000-0fff [disabled] [16-bit]
     Memory behind bridge: d6000000-d9ffffff [size=64M] [32-bit]
     Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
-    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
+    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
     BridgeCtl: Parity+ SERR+ NoISA+ VGA- VGA16- MAbort- >Reset- FastB2B-
         PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
     Capabilities: [40] Subsystem: Dell Device [1028:0236]
@@ -93,7 +93,7 @@
         LnkCap:    Port #1, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp-
         LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
-            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+            ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
         LnkSta:    Speed 2.5GT/s, Width x4
             TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
         RootCap: CRSVisible+
@@ -153,7 +153,7 @@
     I/O behind bridge: f000-0fff [disabled] [16-bit]
     Memory behind bridge: da000000-ddffffff [size=64M] [32-bit]
     Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
-    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort+ <SERR- <PERR-
+    Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
     BridgeCtl: Parity+ SERR+ NoISA+ VGA- VGA16- MAbort- >Reset- FastB2B-
         PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
     Capabilities: [40] Subsystem: Dell Device [1028:0236]
@@ -170,7 +170,7 @@
         LnkCap:    Port #3, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp-
         LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
-            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+            ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
         LnkSta:    Speed 2.5GT/s, Width x4
             TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
         RootCap: CRSVisible+
@@ -247,9 +247,9 @@
         LnkCap:    Port #7, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp-
         LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk+
-            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+            ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
         LnkSta:    Speed 5GT/s, Width x1
-            TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt+
+            TrErr- Train- SlotClk+ DLActive+ BWMgmt- ABWMgmt-
         SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
             Slot #1, PowerLimit 25W; Interlock- NoCompl-
         SltCtl:    Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
@@ -330,7 +330,7 @@
         LnkCap:    Port #9, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
             ClockPM- Surprise+ LLActRep+ BwNot+ ASPMOptComp-
         LnkCtl:    ASPM Disabled; RCB 64 bytes, LnkDisable+ CommClk-
-            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
+            ExtSynch- ClockPM- AutWidDis- BWInt+ AutBWInt+
         LnkSta:    Speed 2.5GT/s, Width x0
             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
         SltCap:    AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
@@ -1068,7 +1068,7 @@
             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
             Status:    NegoPending- InProgress-
-    Kernel driver in use: vfio-pci
+    Kernel driver in use: ahci
     Kernel modules: ahci
 
 06:03.0 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 [102b:0532] (rev 0a) (prog-if 00 [VGA controller])

Console output of actual hang/crash
crash.jpg
 

Attachments

Last edited:
Thanks for your comment. I Just tried this on my test system and I can confirm that at least the VM boots again if i disable the rombar option, and the controller is visible in the VM. I'll have to do a few more in depth tests to see if it runs stable with some real workloads, but on a first glance it looks promising.
 
  • Like
Reactions: mubs
I have now done a longer run with some IO-intensive workloads on my test setup, which didn't show any problems. So yesterday i also set the rombar=0 option for the SATA-controller passthrough on my production machine and re-enabled the 6.14 kernel there, and it's also working flawlessly since then.

So this seems to be the solution in my case. I don't know if there are any negative effects from having rombar=0 set, but so far all is working as expected.