System crash mptSAS when starting VM incl. hardware passtrough.

easyronny

New Member
Jul 30, 2021
9
1
3
41
Netherlands
Dear Proxmox forum members,

I hope that someone on this forum could help me, because i could not find what i configured wrong or what is in this case exactly wrong.
I am fairly new to Proxmox and this is my third post on this forum, (first two are deleted reason unknow to me) but I have been working on this issue now for several weeks. I read a lot of forum posts and I keep falling back into the same error message.

I got this error on proxmox release 7.01 and also release 6.4 and both give the same error. (so it should be a human mistake i think)
After installation of proxmox I configured the first virtual machine (ID100) that includes a hardware PCI-Device with the OS Windows or Linux.
Also If I use a OVMF (uEFI) bios or SeaBIOS it does work in all situations It give me the same result,

On the console screen the following tekst is showing at the moment when a start a virtual machine that includes a hardware passtrough.
In this case a SAS to SATA controller I also try it with a Radeon RX480 and later a NVDIA GT 710.

After I press start in the web console for starting up a virtual machine the entire console is not responding anymore an i can only hardreset the physical machine.

The ERROR on the console screen (see also attachment):

mpt2sas_cm0 sending message unit reset !!
mpt2sas_cm0 sending message reset : SUCCESS
whci_hcd 0000:01:00.0: Remove state 4
usb_usb2: USB disconnect, device nummer 1
usb 1-6 USB Disconnect device nummer 2
usb 1-6.4 USB Disconnect device nummer 4
usb 1-6 .4.3 USB Disconnect device nummer 6
usb 1-7 USB Disconnect device nummer 3
usb 1-10 USB Disconnect device nummer 5
usb 1-10.4 USB Disconnect device nummer 7
whci_hcd 0000:01:00.0 USB bus 1 deregistered
R8169 0000:0a:00.0 eno1: Link is Down
vmbr0: port 1(eno1) entered diabled state
device eno1 left promiscuous mode
vmbr0: port 1(eno1) entered diabled state
ata3.00: disabled
sd 3:0:0:0: [sda] Synchronizing SCSI cache
sd 3:0:0:0: [sda] Synchronizing cache(10) failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
sd 3:0:0:0: [sda] Stopping disk
sd 3:0:0:0: [sda] Start/Stop Unit failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
ata4.00: disabled
sd 4:0:0:0: [sda] Synchronizing SCSI cache
sd 4:0:0:0: [sda] Synchronizing cache(10) failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
sd 4:0:0:0: [sda] Stopping disk
sd 4:0:0:0: [sda] Start/Stop Unit failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
ata5.00: disabled
sd 5:0:0:0: [sda] Synchronizing SCSI cache
sd 5:0:0:0: [sda] Synchronizing cache(10) failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
sd 5:0:0:0: [sda] Stopping disk
sd 5:0:0:0: [sda] Start/Stop Unit failed: Result : hostbyte=DID_BAD_TARGET driverbyte+DRIVER_OK
ata6.00: disabled
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27C14-part3 error=5 type-1 offset=8607170560 size=4096 flags=180880
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27C14-part3 error=5 type-1 offset=270336 size=8192 flags=bc08c1
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27C14-part3 error=5 type-1 offset=999666229248 size=8192 flags=b08c1
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27C14-part3 error=5 type-1 offset=999666491392 size=8192 flags=b08c1
WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27EF1-part3 error=5 type-2 offset=1811103239168 size=4096 flags=184880
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27EF1-part3 error=5 type-2 offset=163228299264 size=4096 flags=184880
io pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27EF1-part3 error=5 type-2 offset=188982915072 size=4096 flags=184880
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27EF1-part3 error=5 type-2 offset=197587263488 size=8192 flags=40080c80
zio pool=rpool vdev=/dev/disk/by-id/ata-CT1000BX500SSD1_2105E4F27EF1-part3 error=5 type-2 offset=206163980288 size=8192 flags=40080c80

----- more of the same zio pool error message as indicated as above changes are the offset=, size= and flags= ----

last two lines are:
WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.
WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended.


My Proxmox hardware config:
AMD Ryzen 3700X
Gigabyte B550 Aorus Pro V2 (1st PCIe slot is 16x-16x 2nd PCI 16x-8x 3rd PCI 16x-8x)
64GB DDR4 Crucial Memory (4x16GB)
2x 1TB Crucial BX500 SSD
Realtek 1GB Quad NIC (last PCI1x)
Asus Strix RX480
Dell Perc H200 (LSI SAS9211-8I) (Flashed in IT Mode) for a future virtual Xpenology config.

My Proxmox software config
ZFS RAID 0, with LZ4 compression (2x Crucial BX500 1TB SSD)
A Linux Network bond config 4 Realtek nics in a 802.3ad config.
2 CIFS connections one towards my NAS and the other to a domain controller.

The below hardware configurations I tested and the following is working or give me the above error message:

Config 1 (first preference):
1st PCI 16x-16x Asus Strix RX480
2nd PCI 16x-8x Dell Perc H200
3rd PCI 16x-8x (emty)
SATA0 Crucial BX500 SSD
SATA1 Crucial BX500 SSD
Passtrough Only PCI 16x-8x Dell Perc H200 (03:00.0)
Result : Above error message

Config 2:
1st PCI 16x-16x Dell Perc H200
2nd PCI 16x-8x Asus Strix RX480
3rd PCI 16x-8x (emty)
SATA0 Crucial BX500 SSD
SATA1 Crucial BX500 SSD
Passtrough Only PCI 16x-8x Dell Perc H200 (0b:00.0)
Result Working (only no VGA passtrough)

Config 4:
1st PCI 16x-16x Dell Perc H200
2nd PCI 16x-8x Asus Strix RX480
3rd PCI 16x-8x (emty)
SATA0 Crucial BX500 SSD
SATA1 Crucial BX500 SSD
Passtrough 1st PCI 16x-8x Dell Perc H200 (0b:00.0)
Passtrough 2nd PCI 16x-8x Asus Strix RX480 (03:00.0)
Error : Above error message

Config 4:
1st PCI 16x-16x Dell Perc H200
2nd PCI 16x-8x Asus Strix RX480
3rd PCI 16x-8x (emty)
SATA2 Crucial BX500 SSD
SATA3 Crucial BX500 SSD
Passtrough 1st PCI 16x-8x Dell Perc H200 (0b:00.0)
Passtrough 2nd PCI 16x-8x Asus Strix RX480 (03:00.0)
Error : Above error message

Config 5:
1st PCI 16x-16x Dell Perc H200
2nd PCI 16x-8x (emty)
3rd PCI 16x-8x Asus Strix RX480
SATA0 Crucial BX500 SSD
SATA1 Crucial BX500 SSD
Passtrough 1st PCI 16x-8x Dell Perc H200 (0b:00.0)
Result Working concept (only no VGA passtrough)

Config 5 (second preference):
1st PCI 16x-16x Asus Strix RX480
2nd PCI 16x-8x (emty)
3rd PCI 16x-8x Dell Perc H200
SATA0 Crucial BX500 SSD
SATA1 Crucial BX500 SSD
Result: Dell Perc H200 is not detected !


Bios setting changed form default:
Enabled : IMMO
Disabled: CSM Support (got also these errors) (if it is enabled same errors)
SATA Mode : AHCI

Changes to Proxmox config files:
Grub (Changes)
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"

Modules (added)
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

IOMMU interrupt remapping (unknow by me if it is needed)
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf

Steps like blacklist (VGA) drivers and Adding GPU to VFIO, have also been tried but gave the same result.

Conclusion: for now is that passtrough is only working via the first PCIe 16x-16x slot, i hope that it can also with the second 16x-8x slot that will have the the Dell PERC H200 card (LSI SAS9211-8I) connected. Regarding the manual of Gigabyte the third PCI 16x-8x is shared with other onboard devices (SATA port 5 and 6 and M2 connectors)

If someone has the fix to got it working, Dell Perc H200 and AMD RX480 both passtrough to seperated virtual machines I am very grateful because I am at the end of my ways. Sorry for my long post but this does reflect all I have tried to fix this, by my self.

Many thanks for you time and help,
Ronny V
 

Attachments

  • 2021_08_01_23_39_23_Photos.jpg
    2021_08_01_23_39_23_Photos.jpg
    585.1 KB · Views: 6
could you try to get a full log (netconsole/serial console?)?
 
@fabian and orthers

Serial console is for me not possible because i dont have a com port on my mainboard
How is a netconsole is working and what do i need to configure? This is new / unknow by me sorry.

There is only one think which i think that it could be the cause.
How can I check if the ACS (override) patch is applied to the Proxmox kernel?
I added to grub the below line only that did not change anything.
pcie_acs_override=downstream,multifunction

And how can I configure all my devices in seperated groups?
Because the devices in iommu_groups/13 are crashing whith each orther as fas as I can see now.

Code:
root@RVProxmox:~# find /sys/kernel/iommu_groups/ -type l

/sys/kernel/iommu_groups/17/devices/0000:0d:00.1

/sys/kernel/iommu_groups/7/devices/0000:00:07.0

/sys/kernel/iommu_groups/15/devices/0000:0c:00.0

/sys/kernel/iommu_groups/5/devices/0000:00:04.0

/sys/kernel/iommu_groups/13/devices/0000:03:00.0

/sys/kernel/iommu_groups/13/devices/0000:09:00.0

/sys/kernel/iommu_groups/13/devices/0000:02:00.0

/sys/kernel/iommu_groups/13/devices/0000:05:05.0

/sys/kernel/iommu_groups/13/devices/0000:08:00.0

/sys/kernel/iommu_groups/13/devices/0000:01:00.2

/sys/kernel/iommu_groups/13/devices/0000:01:00.0

/sys/kernel/iommu_groups/13/devices/0000:0a:00.0

/sys/kernel/iommu_groups/13/devices/0000:02:06.0

/sys/kernel/iommu_groups/13/devices/0000:07:00.0

/sys/kernel/iommu_groups/13/devices/0000:05:01.0

/sys/kernel/iommu_groups/13/devices/0000:06:00.0

/sys/kernel/iommu_groups/13/devices/0000:05:07.0

/sys/kernel/iommu_groups/13/devices/0000:02:08.0

/sys/kernel/iommu_groups/13/devices/0000:01:00.1

/sys/kernel/iommu_groups/13/devices/0000:05:03.0

/sys/kernel/iommu_groups/13/devices/0000:04:00.0

/sys/kernel/iommu_groups/3/devices/0000:00:03.0

/sys/kernel/iommu_groups/11/devices/0000:00:14.3

/sys/kernel/iommu_groups/11/devices/0000:00:14.0

/sys/kernel/iommu_groups/1/devices/0000:00:01.2

/sys/kernel/iommu_groups/18/devices/0000:0d:00.3

/sys/kernel/iommu_groups/8/devices/0000:00:07.1

/sys/kernel/iommu_groups/16/devices/0000:0d:00.0

/sys/kernel/iommu_groups/6/devices/0000:00:05.0

/sys/kernel/iommu_groups/14/devices/0000:0b:00.0

/sys/kernel/iommu_groups/14/devices/0000:0b:00.1

/sys/kernel/iommu_groups/4/devices/0000:00:03.1

/sys/kernel/iommu_groups/12/devices/0000:00:18.3

/sys/kernel/iommu_groups/12/devices/0000:00:18.1

/sys/kernel/iommu_groups/12/devices/0000:00:18.6

/sys/kernel/iommu_groups/12/devices/0000:00:18.4

/sys/kernel/iommu_groups/12/devices/0000:00:18.2

/sys/kernel/iommu_groups/12/devices/0000:00:18.0

/sys/kernel/iommu_groups/12/devices/0000:00:18.7

/sys/kernel/iommu_groups/12/devices/0000:00:18.5

/sys/kernel/iommu_groups/2/devices/0000:00:02.0

/sys/kernel/iommu_groups/10/devices/0000:00:08.1

/sys/kernel/iommu_groups/0/devices/0000:00:01.0

/sys/kernel/iommu_groups/19/devices/0000:0d:00.4

/sys/kernel/iommu_groups/9/devices/0000:00:08.0

root@RVProxmox:~#

Code:
root@RVProxmox:~# lspci

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU

00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:05.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]

00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge

00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 61)

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)

00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0

00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1

00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2

00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3

00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4

00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5

00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6

00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7

01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43ee

01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43eb

01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43e9

02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea

02:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea

02:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43ea

03:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

04:00.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

05:01.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

05:03.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

05:05.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

05:07.0 PCI bridge: ASMedia Technology Inc. ASM1184e PCIe Switch Port

06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

07:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

08:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 07)

0a:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)

0b:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)

0b:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)

0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function

0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP

0d:00.1 Encryption controller: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP

0d:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller

0d:00.4 Audio device: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller
 
Last edited:
Hi @fabian and others,

It looks that the passtrough is working now only im not sure what the solution was.
The following items where executed by me.

1. I did a clean install on a btrfs volume (instead of ZFS)
2. I did a kernel update to version 5.12.2-acso link
A. Tested and passtrough did not work.

3. Change to config files grub and modules as descibed in link.
4. Install the newly drivers that are provided by @fabian in link.

B. Test it again and now im able to passtrough a
1. Dell PERC H200 to a virtual Xpenology system conntect on a 16x8 PCIe slot.
2. Asus RX480 to a virtual Windows 10 system connected on a 16x16 PCIe slot.

Many many thanks @fabian i think the kernel update or the new drivers provided by you solved the issue for me.

For now i will start 3 virtual machines (Windows 10, Windows Server 2019 and Xpeno DSM) and check if everything is working (no memory leaks).

I will post a update later on this week.


Ronny