AMD Epyc PCIe-Passthrough FLR error.

youpko

Member
Mar 7, 2022
13
1
8
32
Greeting All,

I recently accuired some server hardware and was planning on running proxmox on it. While learning on how to use proxmox I ran in to some trouble when trying to passthrough the AMD Epyc internal SATA controller to a VM.

I have an AMD Epyc 7282 on a Gigabyte MZ01-CE1, there are 2 SATA controllers on this configuration. But when I start the VM the system locks up with FLR timeout. I discovered that this unfortunately is a problem with AMD systems, I found a thread on servethehome.com that explains the same problem.

After searching through the proxmox forum I found a thread that goes in to detail on how to compile the proxmox kernel with a patch that disables the FLR function. I followed this successfully and I can start the VM without crashing the system.

But as this is my first time working with this stuff I would like to know what the implication are of running a custom compiled kernel with FLR disabled are.
 
Thanks for the reply, I will take a look at that.

I am quite a novice at linux, and not having to compile the kernal myself would be very nice
 
Thanks for the reply, I will take a look at that.

I am quite a novice at linux, and not having to compile the kernal myself would be very nice
Hi, youpko, did you ever find a solution for this problem?
my pc is epyc 7282 + h11ssl-i,
I got the error "softreset failed" when trying to paasthrough SATAcontroller to my truenas,:confused:
 
Hi, youpko, did you ever find a solution for this problem?
my pc is epyc 7282 + h11ssl-i,
I got the error "softreset failed" when trying to paasthrough SATAcontroller to my truenas,:confused:
With a custom compiled kernel it worked fine as far as I tested it. But I have no way to tell how stable it is nor how reliable it is.

After that I did not have time for anything else. but I just recently started working on this again.

I did one test according to the information leesteken mentioned, but to no success yet.

  1. Updated Proxmox to version 5.15.39
  2. Identify the controller I want to pass through
    Code:
    root@pve:# lspci | grep SATA
    84:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
    85:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
  3. Check with reset_methods are available for the PCI device
    Code:
    root@pve:# cat /sys/bus/pci/devices/0000\:85\:00.0/reset_method
    flr bus
  4. Disable the flr mode
    Code:
    root@pve:# echo bus > /sys/bus/pci/devices/0000\:85\:00.0/reset_method
  5. I repeated step 3 here to check if flr was disabled
    Code:
    root@pve:# cat /sys/bus/pci/devices/0000\:85\:00.0/reset_method
    bus
Al that was easy to configure but after creating a VM and passing the second SATA controller (pci device 85:00.0) to it all went wrong.
Upon starting the VM the first SATA controller (pci device 84:00.0) lost the disks and I got a bunch read errors because my VM disk images are on the first SATA controller.
I have not figured out why the first controller throws issues when I use the second as pass-through.

But the above steps did prevent the system from crashing with the flr reset timeout that I had before this. So this is at least a step in the right direction.

I did remember that the wiki Proxmox PCIe passthrough recommends to disable the device so the host wont use it. But both SATA controllers have the same device IDs, so cant blacklist one of them, only both. There is probably a work around for this but I have not found it yet.
 
Last edited:
Thank U verrry match.

With a custom compiled kernel it worked fine as far as I tested it. But I have no way to tell how stable it is nor how reliable it is.
I think a shell (reorder the reset_method) is better than a compiled kernel. It's not easy to "apt upgrade".

i had tried to set 'bus' reset method yestday, it worked fine.

H11SSL have an onborad nvme port, so i can passthrough all 2 controller to VM.

I will test that one for VM & one for host later.

Good luck!
 
Last edited:
I have not figured out why the first controller throws issues when I use the second as pass-through.
hi, youpko

I had tryed to passthrough my second controller to VM, and the fisrt to Host, that both two worked fine.

sorry, i have no idea for your issue,
good luck to u~~
 
...
Al that was easy to configure but after creating a VM and passing the second SATA controller (pci device 85:00.0) to it all went wrong.
Upon starting the VM the first SATA controller (pci device 84:00.0) lost the disks and I got a bunch read errors because my VM disk images are on the first SATA controller.
I have not figured out why the first controller throws issues when I use the second as pass-through.
Something very much like this happens when devices are in the same IOMMU group, which cannot be shared between VMs and between VMs and the host because devices in a group are not properly isolated from each other. Are you using the pcie_acs_override (which lies to the kernel about the groups)? If not, please show the IOMMU groups using this command: for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done.
EDIT: The command got mangled by copy-paste from search.
 
Last edited:
Something very much like this happens when devices are in the same IOMMU group, which cannot be shared between VMs and between VMs and the host because devices in a group are not properly isolated from each other. Are you using the pcie_acs_override (which lies to the kernel about the groups)? If not, please show the IOMMU groups using this command: for d in /sys/kernel/[I]iommu[/I]_groups/*/devices/*; do n=${d#*/[I]iommu[/I]_groups/*}; n=${n%%/*}; printf '[I]IOMMU[/I] [I]group[/I] %s ' "$n"; lspci -nns "${d##*/}"; done.
I should have said that both controllers are in a different group.

The output of the command you send (I don't see IOMMU groups in list)
Code:
84:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
85:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)

I found this script via google that gives a nice overview of the pcie devices (Proxmox web-ui shows the same)
Code:
Group 28:       [1022:7901] [R] 84:00.0  SATA controller                          FCH SATA Controller [AHCI mode]
Group 29:       [1022:7901] [R] 85:00.0  SATA controller                          FCH SATA Controller [AHCI mode]

So IOMMU groups shouldn't be a problem i think.
 
The output of the command you send (I don't see IOMMU groups in list)
Sorry, I did not notice that the command got mangled by copy and paste from search. And you are sure you are not using pcie_acs_override? Sorry for asking twice, but this is essential for determining the actual groups.
 
sorry i missed the part of `pcie_acs_override` I did not set that.

This is my grub file:
Code:
root@pve:~# cat /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on"
GRUB_CMDLINE_LINUX=""
 
Thanks for confirming that the group information is valid. Note that you don't need amd_iommu=on because it is on by default. You can also check the current kernel parameters with cat /proc/cmdline. Sorry, but I'm out of ideas. Maybe the two controllers are not as isolated as they are presented to be?
 
This looks fine as well right?
Code:
root@pve:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.15.39-1-pve root=/dev/mapper/pve-root ro quiet amd_iommu=on

When I was in the process of the custom kernel compiling I read around the internet that quite a few people had weird issues with the EPYC onboard SATA controllers.
But I have only one server that is not realy production but is in use so testing times are a bit limited.
 
Last edited:
  • Like
Reactions: leesteken
Are you using the latest BIOS version? Often enough BIOS updates can improve compatibility with PCI Passthrough.
 
Are you using the latest BIOS version? Often enough BIOS updates can improve compatibility with PCI Passthrough.
Yes I have the latest version of BIOS and BMC firmware.
BIOS version is R34, and according to the Gigabyte page that is the latest.

Tomorrow I will test it again and trace some logs see if they reveal something usefull.
 
Last edited:
One thing I forgot to ask: Is writing to the /sys/bus/pci/devices/.../reset_method file persistent across reboots?
 
Last edited:
One thing I forgot to ask: Is writing to the /sys/bus/pci/devices/.../reset_method file persistent across reboots?
I wrote a shell like this:
Bash:
root@pve00:/usr/local/bin# vim pve-pre-hook.sh
#!/bin/bash

# SATA controller reset_method
#echo "bus" > /sys/bus/pci/devices/0000:44:00/reset_method
lspci -n|grep 1022:7901 |awk '{print $1}' | while read _id; do echo "set reset_method for PCI ${_id}"; echo "bus" > /sys/bus/pci/devices/0000:$_id/reset_method;done

and insert a commmend in /usr/lib/systemd/system/pve-guests.service, so the shell will run before the guest
Bash:
root@pve00:/usr/local/bin# vim /usr/lib/systemd/system/pve-guests.service
...
ExecStartPre=-/usr/local/bin/pve-pre-hook.sh
...

I do not known if these hvae other eleagant solution, but it worked fine.

hope help u.
 
I thought lets try a manual reset of the SATA controller and see what happens:
First I checked which disk belong to which controller
Code:
root@pve:~# ls -al /sys/block/sd*
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sda -> ../devices/pci0000:80/0000:80:08.2/0000:84:00.0/ata1/host0/target0:0:0/0:0:0:0/block/sda
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdb -> ../devices/pci0000:80/0000:80:08.2/0000:84:00.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdc -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata9/host8/target8:0:0/8:0:0:0/block/sdc
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sdd -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata10/host9/target9:0:0/9:0:0:0/block/sdd
lrwxrwxrwx 1 root root 0 Jul  6 13:23 /sys/block/sde -> ../devices/pci0000:80/0000:80:08.3/0000:85:00.0/ata11/host10/target10:0:0/10:0:0:0/block/sde

Run PCI device reset
Code:
root@pve:~# echo 1 > /sys/bus/pci/devices/0000\:85\:00.0/reset

And then after a minute or so i get the following from dmesg -w
Code:
[765672.687949] ata1.00: exception Emask 0x0 SAct 0x4000c0 SErr 0xd0000 action 0x6 frozen
[765672.688005] ata1: SError: { PHYRdyChg CommWake 10B8B }
[765672.688036] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.688066] ata1.00: cmd 61/40:30:e0:9b:0f/00:00:1a:00:00/40 tag 6 ncq dma 32768 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.688147] ata1.00: status: { DRDY }
[765672.688169] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.688952] ata1.00: cmd 61/08:38:10:6c:db/00:00:13:00:00/40 tag 7 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.689828] ata1.00: status: { DRDY }
[765672.690261] ata1.00: failed command: WRITE FPDMA QUEUED
[765672.690817] ata1.00: cmd 61/08:b0:18:6c:db/00:00:13:00:00/40 tag 22 ncq dma 4096 out
                         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.691766] ata1.00: status: { DRDY }
[765672.692187] ata1: hard resetting link
[765672.692198] ata2.00: exception Emask 0x0 SAct 0x20003 SErr 0xd0000 action 0x6 frozen
[765672.692814] ata2: SError: { PHYRdyChg CommWake 10B8B }
[765672.693334] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.693812] ata2.00: cmd 61/08:00:10:6c:db/00:00:13:00:00/40 tag 0 ncq dma 4096 out
                         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.694752] ata2.00: status: { DRDY }
[765672.695198] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.695679] ata2.00: cmd 61/08:08:28:6c:db/00:00:13:00:00/40 tag 1 ncq dma 4096 out
                         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.696681] ata2.00: status: { DRDY }
[765672.697220] ata2.00: failed command: WRITE FPDMA QUEUED
[765672.697766] ata2.00: cmd 61/40:88:e0:9b:0f/00:00:1a:00:00/40 tag 17 ncq dma 32768 out
                         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[765672.698901] ata2.00: status: { DRDY }
[765672.699470] ata2: hard resetting link
[765678.547743] ata1: failed to reset engine (errno=-5)
[765678.863958] ata2: failed to reset engine (errno=-5)
[765683.192279] ata1: softreset failed (1st FIS failed)
[765683.192868] ata1: hard resetting link
[765683.196459] ata2: softreset failed (1st FIS failed)
[765683.197247] ata2: hard resetting link
[765689.552462] ata1: failed to reset engine (errno=-5)
[765689.555634] ata2: failed to reset engine (errno=-5)
[765693.691804] ata1: softreset failed (1st FIS failed)
[765693.692371] ata1: hard resetting link
[765693.696040] ata2: softreset failed (1st FIS failed)
[765693.696796] ata2: hard resetting link
[765700.044176] ata1: failed to reset engine (errno=-5)
[765700.047532] ata2: failed to reset engine (errno=-5)
[765729.191781] ata1: softreset failed (1st FIS failed)
[765729.192358] ata1: limiting SATA link speed to 3.0 Gbps
[765729.192360] ata1: hard resetting link
[765729.195675] ata2: softreset failed (1st FIS failed)
[765729.202016] ata2: limiting SATA link speed to 3.0 Gbps
[765729.202062] ata2: hard resetting link
[765734.703638] ata1: failed to reset engine (errno=-5)
[765734.710791] ata2: failed to reset engine (errno=-5)
[765734.862820] ata1: softreset failed (device not ready)
[765734.863631] ata1: reset failed, giving up
[765734.864146] ata1.00: disabled
[765734.870799] ata2: softreset failed (device not ready)
[765734.876558] ata2: reset failed, giving up
[765734.882792] ata2.00: disabled
[765735.363261] sd 0:0:0:0: [sda] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=94s
[765735.363266] sd 0:0:0:0: [sda] tag#6 Sense Key : Not Ready [current]
[765735.363268] sd 0:0:0:0: [sda] tag#6 Add. Sense: Logical unit not ready, hard reset required
[765735.363271] sd 0:0:0:0: [sda] tag#6 CDB: Write(10) 2a 00 1a 0f 9b e0 00 00 40 00
[765735.363272] blk_update_request: I/O error, dev sda, sector 437230560 op 0x1:(WRITE) flags 0x700 phys_seg 8 prio class 0
[765735.363880] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=223860998144 size=32768 flags=180880
[765735.363892] sd 0:0:0:0: [sda] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.363893] sd 0:0:0:0: [sda] tag#7 Sense Key : Not Ready [current]
[765735.363895] sd 0:0:0:0: [sda] tag#7 Add. Sense: Logical unit not ready, hard reset required
[765735.363896] sd 0:0:0:0: [sda] tag#7 CDB: Write(10) 2a 00 13 db 6c 10 00 00 08 00
[765735.363897] blk_update_request: I/O error, dev sda, sector 333147152 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.364469] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=170570293248 size=4096 flags=180880
[765735.364480] sd 0:0:0:0: [sda] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.364482] sd 0:0:0:0: [sda] tag#22 Sense Key : Not Ready [current]
[765735.364483] sd 0:0:0:0: [sda] tag#22 Add. Sense: Logical unit not ready, hard reset required
[765735.364484] sd 0:0:0:0: [sda] tag#22 CDB: Write(10) 2a 00 13 db 6c 18 00 00 08 00
[765735.364485] blk_update_request: I/O error, dev sda, sector 333147160 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.365031] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=170570297344 size=4096 flags=180880
[765735.365037] ata1: EH complete
[765735.365108] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.365117] sd 0:0:0:0: [sda] tag#0 CDB: Write(10) 2a 00 00 5c 3a 58 00 00 08 00
[765735.365119] blk_update_request: I/O error, dev sda, sector 6044248 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.365235] sda: detected capacity change from 488397168 to 0
[765735.365268] sd 0:0:0:0: [sda] tag#26 access beyond end of device
[765735.365273] blk_update_request: I/O error, dev sda, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365280] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[765735.365297] sd 0:0:0:0: [sda] tag#27 access beyond end of device
[765735.365299] blk_update_request: I/O error, dev sda, sector 488379408 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365304] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=250049208320 size=8192 flags=b08c1
[765735.365315] sd 0:0:0:0: [sda] tag#28 access beyond end of device
[765735.365316] blk_update_request: I/O error, dev sda, sector 488379920 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[765735.365320] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=1 offset=250049470464 size=8192 flags=b08c1
[765735.366325] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=3093606400 size=4096 flags=180880
[765735.373358] sd 0:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.373362] sd 0:0:0:0: [sda] tag#23 CDB: Write(10) 2a 00 1c 47 e3 18 00 00 10 00
[765735.373364] blk_update_request: I/O error, dev sda, sector 474473240 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.374597] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_840_EVO_250GB_S1DBNSAF825898A-part1 error=5 type=2 offset=242929250304 size=8192 flags=180880
[765735.383227] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.383231] sd 1:0:0:0: [sdb] tag#0 Sense Key : Not Ready [current]
[765735.383232] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Logical unit not ready, hard reset required
[765735.383234] sd 1:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 13 db 6c 10 00 00 08 00
[765735.383235] blk_update_request: I/O error, dev sdb, sector 333147152 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.384000] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570293248 size=4096 flags=180880
[765735.384007] sd 1:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=119s
[765735.384008] sd 1:0:0:0: [sdb] tag#1 Sense Key : Not Ready [current]
[765735.384010] sd 1:0:0:0: [sdb] tag#1 Add. Sense: Logical unit not ready, hard reset required
[765735.384011] sd 1:0:0:0: [sdb] tag#1 CDB: Write(10) 2a 00 13 db 6c 28 00 00 08 00
[765735.384012] blk_update_request: I/O error, dev sdb, sector 333147176 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[765735.384672] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570305536 size=4096 flags=180880
[765735.384682] sd 1:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=94s
[765735.384683] sd 1:0:0:0: [sdb] tag#17 Sense Key : Not Ready [current]
[765735.384685] sd 1:0:0:0: [sdb] tag#17 Add. Sense: Logical unit not ready, hard reset required
[765735.384686] sd 1:0:0:0: [sdb] tag#17 CDB: Write(10) 2a 00 1a 0f 9b e0 00 00 40 00
[765735.384687] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=223860998144 size=32768 flags=180880
[765735.384691] ata2: EH complete
[765735.384709] sd 1:0:0:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.384713] sd 1:0:0:0: [sdb] tag#18 CDB: Write(10) 2a 00 1c 47 e3 18 00 00 10 00
[765735.384715] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=242929250304 size=8192 flags=180880
[765735.384778] sd 1:0:0:0: [sdb] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
[765735.384785] sd 1:0:0:0: [sdb] tag#31 CDB: Write(10) 2a 00 00 5c 3a 58 00 00 08 00
[765735.384789] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=3093606400 size=4096 flags=180880
[765735.384804] sdb: detected capacity change from 488397168 to 0
[765735.384851] sd 1:0:0:0: [sdb] tag#1 access beyond end of device
[765735.384874] sd 1:0:0:0: [sdb] tag#22 access beyond end of device
[765735.384917] sd 1:0:0:0: [sdb] tag#21 access beyond end of device
[765735.384921] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=106290814976 size=36864 flags=40080c80
[765735.384943] sd 1:0:0:0: [sdb] tag#24 access beyond end of device
[765735.384944] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=124823715840 size=12288 flags=40080c80
[765735.384984] sd 1:0:0:0: [sdb] tag#19 access beyond end of device
[765735.384988] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=162255282176 size=12288 flags=40080c80
[765735.385006] sd 1:0:0:0: [sdb] tag#26 access beyond end of device
[765735.385008] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570297344 size=8192 flags=40080c80
[765735.385051] sd 1:0:0:0: [sdb] tag#16 access beyond end of device
[765735.385058] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=170570309632 size=20480 flags=40080c80
[765735.385081] sd 1:0:0:0: [sdb] tag#23 access beyond end of device
[765735.385085] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=176249569280 size=4096 flags=180880
[765735.385631] sd 0:0:0:0: [sda] tag#21 access beyond end of device
[765735.385748] sd 1:0:0:0: [sdb] tag#4 access beyond end of device
[765735.385813] sd 0:0:0:0: [sda] tag#25 access beyond end of device
[765735.385860] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[765735.385942] sd 1:0:0:0: [sdb] tag#21 access beyond end of device
[765735.386719] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=2 offset=98040389632 size=8192 flags=180880
[765735.387526] sd 1:0:0:0: [sdb] tag#2 access beyond end of device
[765735.393871] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=250049208320 size=8192 flags=b08c1
[765735.393881] sd 1:0:0:0: [sdb] tag#3 access beyond end of device
[765735.394460] zio pool=AppData vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_250GB_S21PNSBG702814V-part1 error=5 type=1 offset=250049470464 size=8192 flags=b08c1
[765735.396768] WARNING: Pool 'AppData' has encountered an uncorrectable I/O failure and has been suspended.

After this I tried a reboot , but that hangs as well so at this point I power-cycled the server.

This log shows a lot of errors for ATA1 & ATA2 but those are on 0000:84:00.0 and I executed a reset on 0000:85:00.0 and this leaves me a bit confused. And no where in the log see I errors pertaining drives on the second controller, so I asume that the second controller excuted the reset ok.
 
I was encountered such error. you should check the pci-e is not labelled in the vm settings.

  1. ensure power down you system entirely and power up
  2. check iommu's bios+grub+/etc/modules
  3. bootup and ensure your pve host can read all the drives by lsblk
  4. check sata controller drive (should be achi)
  5. check /sys/bus/pci/devices/.../reset_method contains only bus
  6. check pci-e is not labelled in the vm settings.
  7. start the vm

Code:
vim /etc/default/grub
# amd
# amd_iommu=on by default
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt pcie_acs_override=downstream,multifunction initcall_blacklist=sysfb_init vfio_iommu_type1.allow_unsafe_interrupts=1"
update-grub


cat /etc/modules 
#  gives """vfio vfio_iommu_type1 vfio_pci vfio_virqfd"""
dmesg | grep 'remapping'
# gives"AMD-Vi: Interrupt remapping enabled"

# misllanous settings
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1 report_ignored_msrs=0" > /etc/modprobe.d/kvm.conf

echo "bus" >  /sys/bus/pci/devices/0000:xx:xx0.0/reset_method
cat /sys/bus/pci/devices/0000:xx:xx0.0/reset_method
# gives "bus"

# check pci-e is not labelled in vm settings
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!