Pool 'rpool' has encountered an uncorrectable I/O failure after GPU installation

thermosiphonas

New Member
Dec 23, 2023
12
1
3
I bought a GT 1030 to passthrough to my windows VM but when I install it and boot proxmox I get the message
WARNING: Pool 'rpool' has encountered an uncorrectable I/O failure and has been suspended
and the system hangs. If I remove the GPU everything boots as normal.

I have tried not starting the VMs to see if there is a conflict, but I still get the error message
 
Are you using a NVMe for the rpool? Maybe the PCIe slot and the M.2 slot ( are mutually exclusive?
Maybe the BIOS renumbers the PCI IDs and somehow that changes some ID of the vdev of your rpool?
It's not a common Proxmox issue, maybe it's a Linux incompatibility (you could search outside this forum) with your motherboard?
 
I am using a NVMe drive for the rpool.
I am leaning towards the BIOS renumbering the PCI IDs as a possible cause but how can I fix that?
 
Code:
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:49 with 0 errors on Sun Jan 14 00:24:50 2024
config:

        NAME                               STATE     READ WRITE CKSUM
        rpool                              ONLINE       0     0     0
          nvme-eui.0025385491b053e9-part3  ONLINE       0     0     0

errors: No known data errors

That is without the GPU installed
 
nvme-eui.0025385491b053e9-part3 ONLINE 0 0 0
I never use /dev/disk/by-id/nvme-eui.* but I think those should be stable.

Have your tried any other PCIe device instead of your GT1030, to see if it matters?
Have your tried another PCIe slot for the GT1030, to see if it matters?
Have your tried another M.2 slot for the NVMe, to see if it matters?
Does the motherboard BIOS see the NVMe when the GT1030 is installed? ( guess so, since it boots from it.
Have your tried updating the motherboard BIOS? Have you done any other troubleshooting and can you provide a link to the motherboard manual?
 
I don't have another PCIe device available unfortunately
The motherboard is an ITX one (Asrock B550M-ITX/ac) and only has one NVMe and one PCIe slot. It sees the NVMe when the GPU is installed. It boots up to a point and then hangs.
It is updated to the latest version of BIOS and have tried loading the default settings to BIOS (it booted ok without the GPU).

Here's a link to the manual
https://download.asrock.com/Manual/B550M-ITXac.pdf
 
Boot the system with a Linux Live CD/USB (or the latest Ubuntu installer but don't install it) with the GT1030 installed and check ls /dev/disk/by-id/ to see if the nvme-eui is different.
 
DId you make changes to your Proxmox installation in preparation of adding the GT1030? Like stuff that depends on the PCI ID or automatically start a VM with passthrough? Try disabling all that and definitely don't start VMs with passthrough or automatically while troubleshooting this.
 
Boot the system with a Linux Live CD/USB (or the latest Ubuntu installer but don't install it) with the GT1030 installed and check ls /dev/disk/by-id/ to see if the nvme-eui is different.
I will try this in the afternoon when I get home.

I didn't make any changes to my Proxmox installation (I don't know if something needs to be changed). I have is a NAS VM with HDD passthrough (I passthrough just the HDD, not the whole controller) but I have disabled start at boot for all my VMs and still the system hangs
 
Maybe you can find the logs (journalctl and scroll with the arrow keys) from the minute before (and up to) the freeze/IO-errors and share them?
 
Booted with a Linux Live USB. Both with and without the GPU ls /dev/disk/by id outputs
Code:
nvme-eui.0025385491b053e9
nvme-eui.0025385491b053e9-part1
nvme-eui.0025385491b053e9-part2
nvme-eui.0025385491b053e9-part3
which is the same as in Proxmox.


journalctl -b outputs the contents of the attached file (it was to long to include it in the post)
 

Attachments

Booted with a Linux Live USB. Both with and without the GPU ls /dev/disk/by id outputs
Code:
nvme-eui.0025385491b053e9
nvme-eui.0025385491b053e9-part1
nvme-eui.0025385491b053e9-part2
nvme-eui.0025385491b053e9-part3
which is the same as in Proxmox.
That rules that out.
journalctl -b outputs the contents of the attached file (it was to long to include it in the post)
Why did you run journalctl -b instead of journalctl? A log of a working boot does not give clues about why it fails. Please show the log of the minute before the "encountered an uncorrectable I/O failure". Use something like: add the GPU, boot Proxmox until it freezes with the error, stop the system, remove the GPU, boot Proxmox normally and check the log from the previous boot (and leads to the issue).
 
Sorry, I meant to write journalctl -b -1. That's the one included in the .txt file. It is definitely shorter than a successful boot and the timestamps agree with the stopping of it, but I couldn't find the phrase "encountered an uncorrectable I/O failure" anywhere in it, nor in any of the (journalctl and syslog) going back 2 days.
I have also attached the journalctl of the last successful boot
 

Attachments

but I couldn't find the phrase "encountered an uncorrectable I/O failure" anywhere in it, nor in any of the (journalctl and syslog) going back 2 days.
Once the I/O error occurs, there is no way to write to the drive anymore, so there are no more logs.
The NVMe device PCI ID does shift by one due to the GPU (as do several other devices), but you cannot change that (and it's not a problem for other people).
The message Device: /dev/nvme0, number of Error Log entries increased from 5510 to 5532 is harmless as my 970 EVO Plus also does that when there are no errors (and it works fine with various AMD GPUs).

Maybe search for known issues for your NVMe, motherboard, GPU on the internet, as nothing else stands out.
I don't know what to suggest except updating the firmware of the motherboard and NVMe drive and testing other NVMe drives and other GPU (or just PCIe) devices.

Or maybe it's something stupid like the SSD overheats when the GPU blocks the (already limited) airflow (or increases the case temperature) and it stops working (or becomes too slow for ZFS). If you have the free space on another drive, you could make your ZFS pool into a mirror (be careful not to make it a stripe) and see if it keeps working and logging (when the NVMe half gives errors).
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!