[SOLVED] Passthrough of onboard SATA controller locks up system

Aluveitie · Sep 22, 2022

I am trying to pass through the 2 onboard SATA controllers to VMs. Proxmox itself runs on an NVMe SSD and does not use those controllers.
Both controllers have their own IOMMU group

Code:

IOMMU Group 28 83:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 29 84:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)

What I did so far

Code:

root@server:~# more /etc/modules
amd_iommu_v

vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Code:

root@server:~# more /etc/modprobe.d/vfio.conf
options vfio-pci ids=1022:7901
softdep ahci pre: vfio-pci

Code:

root@server:~# more /etc/modprobe.d/blacklist.conf
blacklist ahci

Which results in

Code:

root@server:~# lspci -k -s 83:00
83:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
        Subsystem: Gigabyte Technology Co., Ltd FCH SATA Controller [AHCI mode]
        Kernel driver in use: vfio-pci
        Kernel modules: ahci
root@server:~# lspci -k -s 84:00
84:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
        Subsystem: Gigabyte Technology Co., Ltd FCH SATA Controller [AHCI mode]
        Kernel driver in use: vfio-pci
        Kernel modules: ahci

I've one SATA disk currently attached, and Proxmox does not list it anymore under Disks.

The VM:

Code:

root@server:~# more /etc/pve/qemu-server/100.conf
bios: ovmf
boot: order=scsi0
cores: 8
cpu: EPYC-Rome,flags=+aes
efidisk0: local-lvm:vm-100-disk-0,efitype=4m,size=4M
hostpci0: 0000:83:00.0,rombar=0
hostpci1: 0000:84:00.0,rombar=0
machine: q35
memory: 12288
meta: creation-qemu=6.2.0,ctime=1662755808
name: truenas
net0: virtio=5A:A2:90:3C:C7:0B,bridge=vmbr0
net1: virtio=7A:B4:AD:7C:C8:F7,bridge=vmbr1,firewall=1
numa: 0
ostype: l26
scsi0: local-lvm:vm-100-disk-1,size=32G
scsi1: local-lvm:vm-100-disk-2,size=4G
scsi2: local-lvm:vm-100-disk-3,size=12G
scsi21: local-lvm:vm-100-disk-4,size=6G
scsi22: local-lvm:vm-100-disk-5,size=6G
scsi23: local-lvm:vm-100-disk-6,size=6G
scsi24: local-lvm:vm-100-disk-9,size=6G
scsi25: local-lvm:vm-100-disk-7,size=4G
scsihw: virtio-scsi-pci
smbios1: uuid=4585d1cc-0603-496b-9e7e-803418c40743
sockets: 1

Starting the VM is hanging and I can find those logs:

Code:

[  651.761767] vfio-pci 0000:83:00.0: not ready 1023ms after FLR; waiting
[  653.809887] vfio-pci 0000:83:00.0: not ready 2047ms after FLR; waiting
[  657.105863] vfio-pci 0000:83:00.0: not ready 4095ms after FLR; waiting

Soon later Proxmox itself becomes unresponsive and I have to hard reset the server...

Code:

Message from syslogd@server at Sep 22 14:52:18 ...
 kernel:[  752.435321] watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [task UPID:serve:3913]

leesteken · Sep 22, 2022

Check whether the controller is alone in its IOMMU group. Devices in the same group cannot be shared between VMs or the VM and the Proxmox host. Maybe the Proxmox host loses its drive controller and cannot write its logs anyymore and maybe also its network device.
This is a common problem for Ryzen motherboards (except X570). What is the make and model of your motherboard?

EDIT: What is the output of

for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

?

EDIT2: I'm stupid and did not read the first couple of lines of the first post, sorry!

Aluveitie · Sep 22, 2022

@leesteken They have their own group.

Code:

root@server:~# for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done | grep -e "28\|29"
IOMMU group 28 83:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU group 29 84:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)

Maybe some context: it is an Epyc 7232P on a Gigabyte MZ31-AR0 rev 2.0.

leesteken · Sep 22, 2022

Aluveitie said:

Starting the VM is hanging and I can find those logs:

Code:

[  651.761767] vfio-pci 0000:83:00.0: not ready 1023ms after FLR; waiting
[  653.809887] vfio-pci 0000:83:00.0: not ready 2047ms after FLR; waiting
[  657.105863] vfio-pci 0000:83:00.0: not ready 4095ms after FLR; waiting

Looks like the SATA controller is not resetting properly.

The work-around to use it once (stopping the VM and starting it again will run into this always) for a VM is to early bind the numeric ([brnd:mdln]) ID (lspci -nns 83:00.0). Unfortunately, there are probably other SATA controllers with the same ID that you don't want to prevent the Proxmox host from touching.

Since kernel 5.15, it is possible to choose the reset-mechanism for each PCIe device. What is the output of cat '/sys/bus/pci/devices/000:83:00.0/reset_method'? Maybe it has more than one reset-methd. You can try writing one of the other possible values to reset_method and see if that helps (does not persist after reboot).

Or figure out how to reset that particular piece of hardware and write a quirk for the driver in the linux kernel. Or buy a different SATA controller PCIe card that is known to work with passthrough (on this forum or another).

EDIT: You appear not be alone with this issue.

Aluveitie · Sep 22, 2022

@leesteken I have those 2 SATA controllers onboard, both of which I plan to pass through (to different VM if possible). I don't need any SATA controller for Proxmox.

Passing through the controller would make things easier, but I would just pass through single disks if necessary.

Here is the output:

Code:

root@server:~# cat '/sys/bus/pci/devices/0000:83:00.0/reset_method'
flr bus

leesteken · Sep 22, 2022

Aluveitie said:
@leesteken I have those 2 SATA controllers onboard, both of which I plan to pass through (to different VM if possible). I don't need any SATA controller for Proxmox.

Still it would not allow for a restart of a VM.

Aluveitie said:
Here is the output:

Code:

root@server:~# cat '/sys/bus/pci/devices/0000:83:00.0/reset_method' flr bus

Maybe doing an echo bus >'/sys/bus/pci/devices/0000:83:00.0/reset_method' before starting the VM would fix this for you?

Aluveitie · Sep 22, 2022

leesteken said:
Still it would not allow for a restart of a VM.

Since it is for my personal use I could live with that, I'd just like the Proxmox booting then starting the VM automatically.

leesteken said:
Maybe doing an echo bus >'/sys/bus/pci/devices/0000:83:00.0/reset_method' before starting the VM would fix this for you?

Genius, that did work!!
What would be the best way to apply that across reboots of Proxmox?

leesteken · Sep 22, 2022

A hookscript for the VM can be used to always execute before starting. crontab @reboot or /etc/rc.local po systemd service might also work. You can find an example in a much related thread.

Aluveitie · Sep 22, 2022

@leesteken Thank you very much for your help

Search

Search

[SOLVED] Passthrough of onboard SATA controller locks up system

Aluveitie

Member

leesteken

Distinguished Member

Aluveitie

Member

leesteken

Distinguished Member

Aluveitie

Member

leesteken

Distinguished Member

Aluveitie

Member

leesteken

Distinguished Member

Aluveitie

Member

We value your privacy