AMD Epyc PCIe-Passthrough FLR error.

youpko · Mar 7, 2022

Greeting All,

I recently accuired some server hardware and was planning on running proxmox on it. While learning on how to use proxmox I ran in to some trouble when trying to passthrough the AMD Epyc internal SATA controller to a VM.

I have an AMD Epyc 7282 on a Gigabyte MZ01-CE1, there are 2 SATA controllers on this configuration. But when I start the VM the system locks up with FLR timeout. I discovered that this unfortunately is a problem with AMD systems, I found a thread on servethehome.com that explains the same problem.

After searching through the proxmox forum I found a thread that goes in to detail on how to compile the proxmox kernel with a patch that disables the FLR function. I followed this successfully and I can start the VM without crashing the system.

But as this is my first time working with this stuff I would like to know what the implication are of running a custom compiled kernel with FLR disabled are.

leesteken · Mar 7, 2022

With kernel 5.15, you can select the reset method (and even disable it) according to some documentation about /sys/bus/pci/devices/.../reset_method. Maybe you can use that as a work-around?

youpko · Mar 7, 2022

Thanks for the reply, I will take a look at that.

I am quite a novice at linux, and not having to compile the kernal myself would be very nice

phenix93 · Jul 10, 2022

youpko said:
Thanks for the reply, I will take a look at that.

I am quite a novice at linux, and not having to compile the kernal myself would be very nice

Hi, youpko, did you ever find a solution for this problem?
my pc is epyc 7282 + h11ssl-i,
I got the error "softreset failed" when trying to paasthrough SATAcontroller to my truenas,

youpko · Jul 10, 2022

phenix93 said:
Hi, youpko, did you ever find a solution for this problem?
my pc is epyc 7282 + h11ssl-i,
I got the error "softreset failed" when trying to paasthrough SATAcontroller to my truenas,

With a custom compiled kernel it worked fine as far as I tested it. But I have no way to tell how stable it is nor how reliable it is.

After that I did not have time for anything else. but I just recently started working on this again.

I did one test according to the information leesteken mentioned, but to no success yet.

Updated Proxmox to version 5.15.39

Identify the controller I want to pass through

Code:

root@pve:# lspci | grep SATA
84:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
85:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

Check with reset_methods are available for the PCI device

Code:

root@pve:# cat /sys/bus/pci/devices/0000\:85\:00.0/reset_method
flr bus

Disable the flr mode

Code:

root@pve:# echo bus > /sys/bus/pci/devices/0000\:85\:00.0/reset_method

I repeated step 3 here to check if flr was disabled

Code:

root@pve:# cat /sys/bus/pci/devices/0000\:85\:00.0/reset_method
bus

Al that was easy to configure but after creating a VM and passing the second SATA controller (pci device 85:00.0) to it all went wrong.
Upon starting the VM the first SATA controller (pci device 84:00.0) lost the disks and I got a bunch read errors because my VM disk images are on the first SATA controller.
I have not figured out why the first controller throws issues when I use the second as pass-through.

But the above steps did prevent the system from crashing with the flr reset timeout that I had before this. So this is at least a step in the right direction.

I did remember that the wiki Proxmox PCIe passthrough recommends to disable the device so the host wont use it. But both SATA controllers have the same device IDs, so cant blacklist one of them, only both. There is probably a work around for this but I have not found it yet.

phenix93 · Jul 11, 2022

Thank U verrry match.

youpko said:
With a custom compiled kernel it worked fine as far as I tested it. But I have no way to tell how stable it is nor how reliable it is.

I think a shell (reorder the reset_method) is better than a compiled kernel. It's not easy to "apt upgrade".

i had tried to set 'bus' reset method yestday, it worked fine.

H11SSL have an onborad nvme port, so i can passthrough all 2 controller to VM.

I will test that one for VM & one for host later.

Good luck!

phenix93 · Jul 12, 2022

youpko said:
I have not figured out why the first controller throws issues when I use the second as pass-through.

hi, youpko

I had tryed to passthrough my second controller to VM, and the fisrt to Host, that both two worked fine.

sorry, i have no idea for your issue,
good luck to u~~

leesteken · Jul 12, 2022

youpko said:
...
Al that was easy to configure but after creating a VM and passing the second SATA controller (pci device 85:00.0) to it all went wrong.
Upon starting the VM the first SATA controller (pci device 84:00.0) lost the disks and I got a bunch read errors because my VM disk images are on the first SATA controller.
I have not figured out why the first controller throws issues when I use the second as pass-through.

Something very much like this happens when devices are in the same IOMMU group, which cannot be shared between VMs and between VMs and the host because devices in a group are not properly isolated from each other. Are you using the pcie_acs_override (which lies to the kernel about the groups)? If not, please show the IOMMU groups using this command:

for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done

.
EDIT: The command got mangled by copy-paste from search.

youpko · Jul 12, 2022

leesteken said:
Something very much like this happens when devices are in the same IOMMU group, which cannot be shared between VMs and between VMs and the host because devices in a group are not properly isolated from each other. Are you using the pcie_acs_override (which lies to the kernel about the groups)? If not, please show the IOMMU groups using this command: for d in /sys/kernel/[I]iommu[/I]_groups/*/devices/*; do n=${d#*/[I]iommu[/I]_groups/*}; n=${n%%/*}; printf '[I]IOMMU[/I] [I]group[/I] %s ' "$n"; lspci -nns "${d##*/}"; done.

I should have said that both controllers are in a different group.

The output of the command you send (I don't see IOMMU groups in list)

Code:

84:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
85:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)

I found this script via google that gives a nice overview of the pcie devices (Proxmox web-ui shows the same)

Code:

Group 28:       [1022:7901] [R] 84:00.0  SATA controller                          FCH SATA Controller [AHCI mode]
Group 29:       [1022:7901] [R] 85:00.0  SATA controller                          FCH SATA Controller [AHCI mode]

So IOMMU groups shouldn't be a problem i think.

leesteken · Jul 12, 2022

youpko said:
The output of the command you send (I don't see IOMMU groups in list)

Sorry, I did not notice that the command got mangled by copy and paste from search. And you are sure you are not using pcie_acs_override? Sorry for asking twice, but this is essential for determining the actual groups.

youpko · Jul 12, 2022

sorry i missed the part of `pcie_acs_override` I did not set that.

This is my grub file:

Code:

root@pve:~# cat /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"
GRUB_CMDLINE_LINUX=""

leesteken · Jul 12, 2022

Thanks for confirming that the group information is valid. Note that you don't need amd_iommu=on because it is on by default. You can also check the current kernel parameters with cat /proc/cmdline. Sorry, but I'm out of ideas. Maybe the two controllers are not as isolated as they are presented to be?

youpko · Jul 12, 2022

This looks fine as well right?

Code:

root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.39-1-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on

When I was in the process of the custom kernel compiling I read around the internet that quite a few people had weird issues with the EPYC onboard SATA controllers.
But I have only one server that is not realy production but is in use so testing times are a bit limited.

mira · Jul 12, 2022

Are you using the latest BIOS version? Often enough BIOS updates can improve compatibility with PCI Passthrough.

youpko · Jul 12, 2022

mira said:
Are you using the latest BIOS version? Often enough BIOS updates can improve compatibility with PCI Passthrough.

Yes I have the latest version of BIOS and BMC firmware.
BIOS version is R34, and according to the Gigabyte page that is the latest.

Tomorrow I will test it again and trace some logs see if they reveal something usefull.

youpko · Jul 12, 2022

One thing I forgot to ask: Is writing to the /sys/bus/pci/devices/.../reset_method file persistent across reboots?

leesteken · Jul 12, 2022

youpko said:
One thing I forgot to ask: Is writing to the /sys/bus/pci/devices/.../reset_method file persistent across reboots?

No. Add a hook script or use crontab or write a systemd service.

phenix93 · Jul 13, 2022

youpko said:
One thing I forgot to ask: Is writing to the /sys/bus/pci/devices/.../reset_method file persistent across reboots?

I wrote a shell like this:

Bash:

root@pve00:/usr/local/bin# vim pve-pre-hook.sh
#!/bin/bash

# SATA controller reset_method
#echo "bus" > /sys/bus/pci/devices/0000:44:00/reset_method
lspci -n|grep 1022:7901 |awk '{print $1}' | while read _id; do echo "set reset_method for PCI ${_id}"; echo "bus" > /sys/bus/pci/devices/0000:$_id/reset_method;done

and insert a commmend in /usr/lib/systemd/system/pve-guests.service, so the shell will run before the guest

Bash:

root@pve00:/usr/local/bin# vim /usr/lib/systemd/system/pve-guests.service
...
ExecStartPre=-/usr/local/bin/pve-pre-hook.sh
...

I do not known if these hvae other eleagant solution, but it worked fine.

hope help u.

youpko · Jul 15, 2022

I thought lets try a manual reset of the SATA controller and see what happens:
First I checked which disk belong to which controller

Code:

root@pve:~# ls -al /sys/block/sd*
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sda -> ../devices/pci0000:80/0000:80:08.2/0000:84:00.0/ata1/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdb -> ../devices/pci0000:80/0000:80:08.2/0000:84:00.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdc -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdd -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sde -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata11/host10/target10:0:0/10:0:0:0/block/sde

Run PCI device reset

Code:

root@pve:~# echo 1 > /sys/bus/pci/devices/0000\:85\:00.0/reset

And then after a minute or so i get the following from dmesg -w

Code:

[765672.687949] ata1.00: exception Emask 0x0 SAct 0x4000c0 SErr 0xd0000 action 0x6 frozen
[765672.688005] ata1: SError: { PHYRdyChg CommWake 10B8B }
[765672.688036] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.688066] ata1.00: cmd 61/40:30:e0:9b:0f/00:00:1a:00:00/40 tag 6 ncq dma 32768 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.688147] ata1.00: status: { DRDY }
[765672.688169] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.688952] ata1.00: cmd 61/08:38:10:6c:db/00:00:13:00:00/40 tag 7 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.689828] ata1.00: status: { DRDY }
[765672.690261] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.690817] ata1.00: cmd 61/08:b0:18:6c:db/00:00:13:00:00/40 tag 22 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.691766] ata1.00: status: { DRDY }
[765672.692187] ata1: hard resetting link
[765672.692198] ata2.00: exception Emask 0x0 SAct 0x20003 SErr 0xd0000 action 0x6 frozen
[765672.692814] ata2: SError: { PHYRdyChg CommWake 10B8B }
[765672.693334] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.693812] ata2.00: cmd 61/08:00:10:6c:db/00:00:13:00:00/40 tag 0 ncq dma 4096 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.694752] ata2.00: status: { DRDY }
[765672.695198] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.695679] ata2.00: cmd 61/08:08:28:6c:db/00:00:13:00:00/40 tag 1 ncq dma 4096 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.696681] ata2.00: status: { DRDY }
[765672.697220] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.697766] ata2.00: cmd 61/40:88:e0:9b:0f/00:00:1a:00:00/40 tag 17 ncq dma 32768 out
                         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.698901] ata2.00: status: { DRDY }
[765672.699470] ata2: hard resetting link
[765678.547743] ata1: failed to reset engine (errno=-5)
[765678.863958] ata2: failed to reset engine (errno=-5)
[765683.192279] ata1: softreset failed (1st FIS failed)
[765683.192868] ata1: hard resetting link
[765683.196459] ata2: softreset failed (1st FIS failed)
[765683.197247] ata2: hard resetting link
[765689.552462] ata1: failed to reset engine (errno=-5)
[765689.555634] ata2: failed to reset engine (errno=-5)
[765693.691804] ata1: softreset failed (1st FIS failed)
[765693.692371] ata1: hard resetting link
[765693.696040] ata2: softreset failed (1st FIS failed)
[765693.696796] ata2: hard resetting link
[765700.044176] ata1: failed to reset engine (errno=-5)
[765700.047532] ata2: failed to reset engine (errno=-5)
[765729.191781] ata1: softreset failed (1st FIS failed)
[765729.192358] ata1: limiting SATA link speed to 3.0 Gbps
[765729.192360] ata1: hard resetting link
[765729.195675] ata2: softreset failed (1st FIS failed)
[765729.202016] ata2: limiting SATA link speed to 3.0 Gbps
[765729.202062] ata2: hard resetting link
[765734.703638] ata1: failed to reset engine (errno=-5)
[765734.710791] ata2: failed to reset engine (errno=-5)
[765734.862820] ata1: softreset failed (device not ready)
[765734.863631] ata1: reset failed, giving up
[765734.864146] ata1.00: disabled
[765734.870799] ata2: softreset failed (device not ready)
[765734.876558] ata2: reset failed, giving up
[765734.882792] ata2.00: disabled
[765735.363261] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=94s
[765735.363266] sd 0:0:0:0: [sda] tag#6 Sense Key : Not Ready [current]
[765735.363268] sd 0:0:0:0: [sda] tag#6 Add. Sense: Logical unit not ready, hard reset required
[765735.363271] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 1a 0f 9b e0 00 00 40 00
[765735.363272] blk_update_request: I/O error, dev sda, sector 437230560 op 0x1:(WRITE) flags 0x700 phys_seg 8 prio class 0
[765735.363880] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=223860998144 size=32768 flags=180880
[765735.363892] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.363893] sd 0:0:0:0: [sda] tag#7 Sense Key : Not Ready [current]
[765735.363895] sd 0:0:0:0: [sda] tag#7 Add. Sense: Logical unit not ready, hard reset required
[765735.363896] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 13 db 6c 10 00 00 08 00
[765735.363897] blk_update_request: I/O error, dev sda, sector 333147152 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.364469] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=170570293248 size=4096 flags=180880
[765735.364480] sd 0:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.364482] sd 0:0:0:0: [sda] tag#22 Sense Key : Not Ready [current]
[765735.364483] sd 0:0:0:0: [sda] tag#22 Add. Sense: Logical unit not ready, hard reset required
[765735.364484] sd 0:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 13 db 6c 18 00 00 08 00
[765735.364485] blk_update_request: I/O error, dev sda, sector 333147160 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.365031] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=170570297344 size=4096 flags=180880
[765735.365037] ata1: EH complete
[765735.365108] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.365117] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 5c 3a 58 00 00 08 00
[765735.365119] blk_update_request: I/O error, dev sda, sector 6044248 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.365235] sda: detected capacity change from 488397168 to 0
[765735.365268] sd 0:0:0:0: [sda] tag#26 access beyond end of device
[765735.365273] blk_update_request: I/O error, dev sda, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365280] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[765735.365297] sd 0:0:0:0: [sda] tag#27 access beyond end of device
[765735.365299] blk_update_request: I/O error, dev sda, sector 488379408 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365304] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=250049208320 size=8192 flags=b08c1
[765735.365315] sd 0:0:0:0: [sda] tag#28 access beyond end of device
[765735.365316] blk_update_request: I/O error, dev sda, sector 488379920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365320] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=250049470464 size=8192 flags=b08c1
[765735.366325] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=3093606400 size=4096 flags=180880
[765735.373358] sd 0:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.373362] sd 0:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 1c 47 e3 18 00 00 10 00
[765735.373364] blk_update_request: I/O error, dev sda, sector 474473240 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.374597] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=242929250304 size=8192 flags=180880
[765735.383227] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.383231] sd 1:0:0:0: [sdb] tag#0 Sense Key : Not Ready [current]
[765735.383232] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Logical unit not ready, hard reset required
[765735.383234] sd 1:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 13 db 6c 10 00 00 08 00
[765735.383235] blk_update_request: I/O error, dev sdb, sector 333147152 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.384000] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570293248 size=4096 flags=180880
[765735.384007] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.384008] sd 1:0:0:0: [sdb] tag#1 Sense Key : Not Ready [current]
[765735.384010] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Logical unit not ready, hard reset required
[765735.384011] sd 1:0:0:0: [sdb] tag#1 CDB: Write(10) 2a 00 13 db 6c 28 00 00 08 00
[765735.384012] blk_update_request: I/O error, dev sdb, sector 333147176 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.384672] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570305536 size=4096 flags=180880
[765735.384682] sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=94s
[765735.384683] sd 1:0:0:0: [sdb] tag#17 Sense Key : Not Ready [current]
[765735.384685] sd 1:0:0:0: [sdb] tag#17 Add. Sense: Logical unit not ready, hard reset required
[765735.384686] sd 1:0:0:0: [sdb] tag#17 CDB: Write(10) 2a 00 1a 0f 9b e0 00 00 40 00
[765735.384687] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=223860998144 size=32768 flags=180880
[765735.384691] ata2: EH complete
[765735.384709] sd 1:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.384713] sd 1:0:0:0: [sdb] tag#18 CDB: Write(10) 2a 00 1c 47 e3 18 00 00 10 00
[765735.384715] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=242929250304 size=8192 flags=180880
[765735.384778] sd 1:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.384785] sd 1:0:0:0: [sdb] tag#31 CDB: Write(10) 2a 00 00 5c 3a 58 00 00 08 00
[765735.384789] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=3093606400 size=4096 flags=180880
[765735.384804] sdb: detected capacity change from 488397168 to 0
[765735.384851] sd 1:0:0:0: [sdb] tag#1 access beyond end of device
[765735.384874] sd 1:0:0:0: [sdb] tag#22 access beyond end of device
[765735.384917] sd 1:0:0:0: [sdb] tag#21 access beyond end of device
[765735.384921] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=106290814976 size=36864 flags=40080c80
[765735.384943] sd 1:0:0:0: [sdb] tag#24 access beyond end of device
[765735.384944] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=124823715840 size=12288 flags=40080c80
[765735.384984] sd 1:0:0:0: [sdb] tag#19 access beyond end of device
[765735.384988] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=162255282176 size=12288 flags=40080c80
[765735.385006] sd 1:0:0:0: [sdb] tag#26 access beyond end of device
[765735.385008] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570297344 size=8192 flags=40080c80
[765735.385051] sd 1:0:0:0: [sdb] tag#16 access beyond end of device
[765735.385058] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570309632 size=20480 flags=40080c80
[765735.385081] sd 1:0:0:0: [sdb] tag#23 access beyond end of device
[765735.385085] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=176249569280 size=4096 flags=180880
[765735.385631] sd 0:0:0:0: [sda] tag#21 access beyond end of device
[765735.385748] sd 1:0:0:0: [sdb] tag#4 access beyond end of device
[765735.385813] sd 0:0:0:0: [sda] tag#25 access beyond end of device
[765735.385860] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[765735.385942] sd 1:0:0:0: [sdb] tag#21 access beyond end of device
[765735.386719] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=98040389632 size=8192 flags=180880
[765735.387526] sd 1:0:0:0: [sdb] tag#2 access beyond end of device
[765735.393871] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=250049208320 size=8192 flags=b08c1
[765735.393881] sd 1:0:0:0: [sdb] tag#3 access beyond end of device
[765735.394460] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=250049470464 size=8192 flags=b08c1
[765735.396768] WARNING: Pool 'AppData' has encountered an uncorrectable I/O failure and has been suspended.

After this I tried a reboot , but that hangs as well so at this point I power-cycled the server.

This log shows a lot of errors for ATA1 & ATA2 but those are on 0000:84:00.0 and I executed a reset on 0000:85:00.0 and this leaves me a bit confused. And no where in the log see I errors pertaining drives on the second controller, so I asume that the second controller excuted the reset ok.

cyrus_hu · Nov 9, 2022

I was encountered such error. you should check the pci-e is not labelled in the vm settings.

ensure power down you system entirely and power up
check iommu's bios+grub+/etc/modules
bootup and ensure your pve host can read all the drives by lsblk
check sata controller drive (should be achi)
check /sys/bus/pci/devices/.../reset_method contains only bus
check pci-e is not labelled in the vm settings.
start the vm

Code:

vim /etc/default/grub
# amd
# amd_iommu=on by default
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init vfio_iommu_type1.allow_unsafe_interrupts=1"
update-grub


cat /etc/modules 
#  gives """vfio vfio_iommu_type1 vfio_pci vfio_virqfd"""
dmesg | grep 'remapping'
# gives"AMD-Vi: Interrupt remapping enabled"

# misllanous settings
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf

echo "bus" >  /sys/bus/pci/devices/0000:xx:xx0.0/reset_method
cat /sys/bus/pci/devices/0000:xx:xx0.0/reset_method
# gives "bus"

# check pci-e is not labelled in vm settings

AMD Epyc PCIe-Passthrough FLR error.

Member

Distinguished Member

Member

New Member

Member

New Member

New Member

Distinguished Member

Member

Distinguished Member

Member

Distinguished Member

Member

Proxmox Staff Member

Member

Member

Distinguished Member

New Member

Member

Member

We value your privacy