Disks get shut down if PCIe device for VM is configured

kedala

New Member
Oct 28, 2018
7
0
1
28
Hello everyone,

i have an issue which is driving me nuts right know and does not encounter on other hardware.

machine with issue:
- HPe Microserver Gen10 (Part-Number: 873830-421)
- Additional Network-Cards plugged in via PCIe:
-- Intel Corporation 82571EB
-- Mellanox Technologies MT26448
-- 4 x 3TB WD Red

machine without issue:
- Costum Build X399 Computer with:
-- Gigabyte Aorus X399
-- Threadripper 1950X
-- AMD RX 580
-- 4 x 6TB WD Red


Now to the actual issue:

If i configure a PCIe device (in my case the 82571EB for my PfSense VM) and i try to start the VM
the disks of the pool of the VM's hard-disk (data) just get their cashes flushed and shut down.
The machine crashes and is unresponsive until a hard reset.
If a PCI or PCIe device isn't configured everything is just as normal as with any other machine i run with proxmox. I've blacklisted the e1000e driver and checked if the module was loaded for the card, which it wasn't.
I triple checked, that i configured the network card and not, accidentally, the SATA controller.
I also allowed unsafe interrupts (As i did on my other AMD machines and behause dmesg told me i needed it)

I've attached configs as well as a screenshot from the console that shows the log from the point of starting the VM.
What i've also done was to switch from a BTRFS pool to a ZFS pool just to be sure that it has nothing
to do with the filesystem.

lspci | grep -i ethernet
Code:
04:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
04:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
05:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
05:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) (rev 06)
06:00.0 Ethernet controller: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe
06:00.1 Ethernet controller: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe
07:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)

cat /etc/pve/qemu-server/100.conf
Code:
boot: dcn
bootdisk: scsi0
cores: 1
ide2: local:iso/pfSense-CE-2.4.4-RELEASE-amd64.iso,media=cdrom
machine: q35
hostpci0: 04:00.0,pcie=1
memory: 1024
name: PfSense
net0: virtio=AA:6B:02:82:CC:A1,bridge=vmbr0
numa: 0
ostype: l26
scsi0: data:vm-100-disk-0,size=32G
scsihw: virtio-scsi-pci
smbios1: uuid=ad8cbeac-e32c-4de9-82a9-2dad0482d1a9
sockets: 1

zpool status
Code:
  pool: Storage
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        Storage     ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sda     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0

Screenshot from crash:
proxmox_pci.jpg


I really don't know what to do at this point.
If anyone knows how to fix this it would be greatly appreciated!

Thanks in advance
kedala
 
Last edited:
can you post your iommu groups ?
 
Hi dcsapak,

sure.


find /sys/kernel/iommu_groups/ -type l
Code:
/sys/kernel/iommu_groups/7/devices/0000:00:12.0                                                                                                                            
/sys/kernel/iommu_groups/5/devices/0000:00:10.0                                                                                                                                         
/sys/kernel/iommu_groups/3/devices/0000:00:08.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:00:02.5                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:05:00.1                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:02:00.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:04:00.1                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:01:00.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:03:02.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:06:00.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:00:02.4                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:05:00.0                                                                                                                                         
/sys/kernel/iommu_groups/1/devices/0000:00:02.2
/sys/kernel/iommu_groups/1/devices/0000:03:04.0
/sys/kernel/iommu_groups/1/devices/0000:04:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/1/devices/0000:06:00.1
/sys/kernel/iommu_groups/8/devices/0000:00:14.3
/sys/kernel/iommu_groups/8/devices/0000:00:14.0
/sys/kernel/iommu_groups/6/devices/0000:00:11.0
/sys/kernel/iommu_groups/4/devices/0000:00:09.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.1
/sys/kernel/iommu_groups/2/devices/0000:07:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/9/devices/0000:00:18.3
/sys/kernel/iommu_groups/9/devices/0000:00:18.1
/sys/kernel/iommu_groups/9/devices/0000:00:18.4
/sys/kernel/iommu_groups/9/devices/0000:00:18.2
/sys/kernel/iommu_groups/9/devices/0000:00:18.0
/sys/kernel/iommu_groups/9/devices/0000:00:18.5

As well as a more readable format extracted with:
Code:
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
            n=${d#*/iommu_groups/*}; n=${n%%/*}
            printf 'IOMMU Group %s ' "$n"
            lspci -nns "${d##*/}
done

Code:
IOMMU Group 0 00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Carrizo [1002:9874] (rev 87)
IOMMU Group 1 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157b]
IOMMU Group 1 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 00:02.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11)
IOMMU Group 1 02:00.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 03:02.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 03:04.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 04:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 04:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 05:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 05:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 06:00.0 Ethernet controller [0200]: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 1 06:00.1 Ethernet controller [0200]: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 2 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157b]
IOMMU Group 2 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 2 07:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
IOMMU Group 3 00:08.0 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Device [1022:1578]
IOMMU Group 4 00:09.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157d]
IOMMU Group 5 00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20)
IOMMU Group 6 00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 49)
IOMMU Group 7 00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)
IOMMU Group 8 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 4a)
IOMMU Group 8 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 11)
IOMMU Group 9 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1570]
IOMMU Group 9 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1571]
IOMMU Group 9 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1572]
IOMMU Group 9 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1573]
IOMMU Group 9 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1574]
IOMMU Group 9 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1575]

EDIT:

I think i get it?
The SATA Controller seems to be in the same IOMMU Group, as well as many other devices, as the network card.
So if i try to forward it, it does something with memory management and just crashes everything in the IOMMU group?

On my other server the GPU is, with i's audio device, in IOMMU group 28 and it's the only device in there.


What i probably have to mention is, that i needed to set a kernel parameter for the SATA controller to work properly.
This is a kernel bug i discovered on an e-Mail Thread just yesterday. "nointremap"
Code:
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommo=on iommu=pt nointremap"
 
Last edited:
Hi dcsapak,

sure.


find /sys/kernel/iommu_groups/ -type l
Code:
/sys/kernel/iommu_groups/7/devices/0000:00:12.0                                                                                                                           
/sys/kernel/iommu_groups/5/devices/0000:00:10.0                                                                                                                                        
/sys/kernel/iommu_groups/3/devices/0000:00:08.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:00:02.5                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:05:00.1                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:02:00.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:04:00.1                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:01:00.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:03:02.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:06:00.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:00:02.4                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:05:00.0                                                                                                                                        
/sys/kernel/iommu_groups/1/devices/0000:00:02.2
/sys/kernel/iommu_groups/1/devices/0000:03:04.0
/sys/kernel/iommu_groups/1/devices/0000:04:00.0
/sys/kernel/iommu_groups/1/devices/0000:00:02.0
/sys/kernel/iommu_groups/1/devices/0000:06:00.1
/sys/kernel/iommu_groups/8/devices/0000:00:14.3
/sys/kernel/iommu_groups/8/devices/0000:00:14.0
/sys/kernel/iommu_groups/6/devices/0000:00:11.0
/sys/kernel/iommu_groups/4/devices/0000:00:09.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.1
/sys/kernel/iommu_groups/2/devices/0000:07:00.0
/sys/kernel/iommu_groups/2/devices/0000:00:03.0
/sys/kernel/iommu_groups/0/devices/0000:00:01.0
/sys/kernel/iommu_groups/9/devices/0000:00:18.3
/sys/kernel/iommu_groups/9/devices/0000:00:18.1
/sys/kernel/iommu_groups/9/devices/0000:00:18.4
/sys/kernel/iommu_groups/9/devices/0000:00:18.2
/sys/kernel/iommu_groups/9/devices/0000:00:18.0
/sys/kernel/iommu_groups/9/devices/0000:00:18.5

As well as a more readable format extracted with:
Code:
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
            n=${d#*/iommu_groups/*}; n=${n%%/*}
            printf 'IOMMU Group %s ' "$n"
            lspci -nns "${d##*/}
done

Code:
IOMMU Group 0 00:01.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Carrizo [1002:9874] (rev 87)
IOMMU Group 1 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157b]
IOMMU Group 1 00:02.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 00:02.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 00:02.5 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 1 01:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller [1b4b:9230] (rev 11)
IOMMU Group 1 02:00.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 03:02.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 03:04.0 PCI bridge [0604]: Integrated Device Technology, Inc. [IDT] PES12N3A PCI Express Switch [111d:8018] (rev 0e)
IOMMU Group 1 04:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 04:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 05:00.0 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 05:00.1 Ethernet controller [0200]: Intel Corporation 82571EB Gigabit Ethernet Controller (Copper) [8086:10bc] (rev 06)
IOMMU Group 1 06:00.0 Ethernet controller [0200]: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 1 06:00.1 Ethernet controller [0200]: Broadcom Limited NetXtreme BCM5720 Gigabit Ethernet PCIe [14e4:165f]
IOMMU Group 2 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157b]
IOMMU Group 2 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:157c]
IOMMU Group 2 07:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
IOMMU Group 3 00:08.0 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Device [1022:1578]
IOMMU Group 4 00:09.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:157d]
IOMMU Group 5 00:10.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB XHCI Controller [1022:7914] (rev 20)
IOMMU Group 6 00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 49)
IOMMU Group 7 00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] FCH USB EHCI Controller [1022:7908] (rev 49)
IOMMU Group 8 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 4a)
IOMMU Group 8 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 11)
IOMMU Group 9 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1570]
IOMMU Group 9 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1571]
IOMMU Group 9 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1572]
IOMMU Group 9 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1573]
IOMMU Group 9 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1574]
IOMMU Group 9 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Device [1022:1575]

EDIT:

I think i get it?
The SATA Controller seems to be in the same IOMMU Group, as well as many other devices, as the network card.
So if i try to forward it, it does something with memory management and just crashes everything in the IOMMU group?

On my other server the GPU is, with i's audio device, in IOMMU group 28 and it's the only device in there.


What i probably have to mention is, that i needed to set a kernel parameter for the SATA controller to work properly.
This is a kernel bug i discovered on an e-Mail Thread just yesterday. "nointremap"
Code:
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommo=on iommu=pt nointremap"

I'm gonna try setting "pcie_acs_override=downstream" when i'm home.
Get back here as soon as i have results.
 
I'm gonna try setting "pcie_acs_override=downstream" when i'm home.
could work, but please know what the caveats are (potentially the guest could access the host via the pci interface)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!