Alright, this is a pretty niche issue but I'm hoping someone will be able to help.
I have an 8 bay QNAP TL-D800s DAS plugged into my Proxmox machine (Minisforum MS01 with latest BIOS and Intel microcode) and I pass through the PCIe QXP-800eS-A1164 card (two ASM1164 chips) to an Unraid VM.
Everything works fine with spinning disks.
But as soon as I also connect a SATA SSD (Samsung PM883) into the DAS and try to write to it, the device starts to fail and I have to reboot the VM to get the SSD recognized again.
The error I'm getting is:
Now, I've searched the web up and down and usually, with this type of failure it's a hardware issue e.g.:
* SATA cable
* failing disk
* HBA card
* motherboard/BIOS
However, I'm very sure this is not an issue with bad hardware. I've tried different SSDs and different cables, for instance.
But mostly I know it's not a hardware issue because I've booted into Unraid directly, without Proxmox, and there the SSD works.
The SSD also works if I pass it through as SCSI block device to the Unraid VM (i.e. the ASM1164 chip is only exposed to the Proxmox host and not the VM). But the speed is diminished to less than a third when doing this (122 MB/s vs 450 MB/s) and I would have to do this for the other 3 SATA slots on this PCIe chip as well which I'd like to fill with HDDs eventually so block device pass through is not a great option.
So this must be some issue with PCI passthrough and it strangely only affects SSDs, not the HDDs in the DAS.
I've tried disabling NCQ in Unraid to no avail, as well as different file systems (XFS, BTRFS, ZFS) but the ASM1164 chip always breaks down after writing a few gigabytes to it, even if i just dd the unpartitioned block device. I've tried different generations of SATA SSDs (all Samsung though, SM883, SM863a, PM883, with naturally different firmwares).
As far as I can tell, IOMMU groups are working correctly, although the PCI bridge has the same group ID as the SATA controller I'm passing through (groups 15 & 18, see troubleshooting info below).
It doesn't make a difference whether i pass through the 2 ASM1164 chips as mapped devices or as raw devices with all flags enabled. And it also doesn't make a difference which of the 2 physical cables the SSD uses or which SATA slot it is plugged into or whether it's a single unmounted SSD or multiple in a pool. As soon as I write to an SSD, the device/pool fails and locks up while the HDDs keep working (the 4 HDDs are on the second cable & chip).
I also just noticed that the SSD does not lock up if i skip the page cache by using the direct flag in dd:
Why would that be?
Any ideas what this could be or what else I could try?
I know passing through a DAS and virtualizing Unraid is a bit of an esoteric use case but I'm trying to avoid buying additional hardware in a homelab scenario and feel like 99% of my setup works really well. Until I throw in the SSD as cache drive, that is ...
Leaving some more troubleshooting info here:
I have an 8 bay QNAP TL-D800s DAS plugged into my Proxmox machine (Minisforum MS01 with latest BIOS and Intel microcode) and I pass through the PCIe QXP-800eS-A1164 card (two ASM1164 chips) to an Unraid VM.
Everything works fine with spinning disks.
But as soon as I also connect a SATA SSD (Samsung PM883) into the DAS and try to write to it, the device starts to fail and I have to reboot the VM to get the SSD recognized again.
The error I'm getting is:
Code:
[ 277.760008] ata7.00: exception Emask 0x0 SAct 0xffe000 SErr 0x0 action 0x6 frozen
[ 277.760412] ata7.00: failed command: WRITE FPDMA QUEUED
[ 277.760417] ata7.00: cmd 61/00:b8:c0:bc:93/0a:00:01:00:00/40 tag 23 ncq dma 1310720 ou
res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 277.760437] ata7.00: status: { DRDY }
...
[ 277.760453] ata7: hard resetting link
[ 278.952990] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 284.212765] ata7.00: qc timeout after 5000 msecs (cmd 0xec)
[ 284.212804] ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 284.212815] ata7.00: revalidation failed (errno=-5)
[ 284.212829] ata7: hard resetting link
[ 285.356967] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 295.478282] ata7.00: qc timeout after 10000 msecs (cmd 0xec)
[ 295.478322] ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 295.478334] ata7.00: revalidation failed (errno=-5)
[ 295.478346] ata7: limiting SATA link speed to 3.0 Gbps
[ 295.478358] ata7: hard resetting link
[ 296.670816] ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Now, I've searched the web up and down and usually, with this type of failure it's a hardware issue e.g.:
* SATA cable
* failing disk
* HBA card
* motherboard/BIOS
However, I'm very sure this is not an issue with bad hardware. I've tried different SSDs and different cables, for instance.
But mostly I know it's not a hardware issue because I've booted into Unraid directly, without Proxmox, and there the SSD works.
The SSD also works if I pass it through as SCSI block device to the Unraid VM (i.e. the ASM1164 chip is only exposed to the Proxmox host and not the VM). But the speed is diminished to less than a third when doing this (122 MB/s vs 450 MB/s) and I would have to do this for the other 3 SATA slots on this PCIe chip as well which I'd like to fill with HDDs eventually so block device pass through is not a great option.
So this must be some issue with PCI passthrough and it strangely only affects SSDs, not the HDDs in the DAS.
I've tried disabling NCQ in Unraid to no avail, as well as different file systems (XFS, BTRFS, ZFS) but the ASM1164 chip always breaks down after writing a few gigabytes to it, even if i just dd the unpartitioned block device. I've tried different generations of SATA SSDs (all Samsung though, SM883, SM863a, PM883, with naturally different firmwares).
As far as I can tell, IOMMU groups are working correctly, although the PCI bridge has the same group ID as the SATA controller I'm passing through (groups 15 & 18, see troubleshooting info below).
It doesn't make a difference whether i pass through the 2 ASM1164 chips as mapped devices or as raw devices with all flags enabled. And it also doesn't make a difference which of the 2 physical cables the SSD uses or which SATA slot it is plugged into or whether it's a single unmounted SSD or multiple in a pool. As soon as I write to an SSD, the device/pool fails and locks up while the HDDs keep working (the 4 HDDs are on the second cable & chip).
I also just noticed that the SSD does not lock up if i skip the page cache by using the direct flag in dd:
Code:
dd if=/dev/zero of=./perf-test-dd bs=4k iflag=fullblock,count_bytes oflag=direct count=4G
Any ideas what this could be or what else I could try?
I know passing through a DAS and virtualizing Unraid is a bit of an esoteric use case but I'm trying to avoid buying additional hardware in a homelab scenario and feel like 99% of my setup works really well. Until I throw in the SSD as cache drive, that is ...
Leaving some more troubleshooting info here:
Code:
# On newest PVE version
root@ms01-1:~# pveversion
pve-manager/8.3.2/3e76eec21c4a14a7 (running kernel: 6.8.12-5-pve)
# Enabled iommu both in BIOS and Proxmox
root@ms01-1:~# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.12-5-pve root=/dev/mapper/pve-root ro quiet intel_iommu=on iommu=pt
# Group 15 is slot 1, Group 18 is slot 2
root@ms01-1:~# for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU group %s ' "$n"; lspci -nns "${d##*/}"; done
...
IOMMU group 14 01:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2812 6-Port PCIe x4 Gen3 Packet Switch [1b21:2812] (rev 01)
IOMMU group 15 02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2812 6-Port PCIe x4 Gen3 Packet Switch [1b21:2812] (rev 01)
IOMMU group 15 03:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1164 Serial ATA AHCI Controller [1b21:1164] (rev 02)
IOMMU group 16 02:02.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2812 6-Port PCIe x4 Gen3 Packet Switch [1b21:2812] (rev 01)
IOMMU group 17 02:03.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2812 6-Port PCIe x4 Gen3 Packet Switch [1b21:2812] (rev 01)
IOMMU group 18 02:08.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2812 6-Port PCIe x4 Gen3 Packet Switch [1b21:2812] (rev 01)
IOMMU group 18 06:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1164 Serial ATA AHCI Controller [1b21:1164] (rev 02)
...
# The unraid VM with both PCI devices passed through
root@ms01-1:/etc/pve/qemu-server# cat /etc/pve/qemu-server/100.conf
agent: 1
args: -cpu 'host,+kvm_pv_unhalt,+kvm_pv_eoi,hv_vendor_id=NV43FIX,kvm=off'
bios: ovmf
boot: order=usb0
cores: 8
cpu: host,hidden=1,flags=+pcid
efidisk0: local-lvm:vm-100-disk-2,efitype=4m,size=4M
hostpci0: 0000:03:00,pcie=1
hostpci1: 0000:06:00,pcie=1
machine: q35
memory: 16384
meta: creation-qemu=9.0.2,ctime=1733583787
name: unraid
net0: virtio=BC:24:11:2D:8A:85,bridge=vmbr0
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-100-disk-0,cache=writeback,discard=on,iothread=1,size=512G,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=5533f6d3-41bc-48ed-8a7c-2c5ee0ce1ff5
sockets: 1
startup: order=5
usb0: mapping=unraid-usb
vmgenid: ae50c49c-ea1b-4fd1-828b-9843e0947c4c
Last edited: