PCIE NVME timeout after upgrading to PVE-Manager 7.2-7 / PVE kernel 5.15.39-3

danjb

New Member
Aug 8, 2022
1
0
1
Flower Mound, TX
I have a Supermicro H12SSL MB with two M.2 PCIE slots, populated with two Seagate FireCuda 530 M.2 2280 PCIe NVMe cards, one 2TB and one 4TB. This system has an Epyc 7543P processor, and boots Proxmox off a ZFS RAID-1 pair of SATA attached SSD's. I have all assets (ISO's, templates, etc) on the 2TB NVME, and all VM's / containers on the 4TB NVME. This was running the base Proxmox May 7.2 install without an issue.

Friday, I clicked update to update to the latest PVE-Manager 7.2-7 / PVE kernel 5.15.39-3. At the end when I rebooted, I get a long delay during boot with a series of messages like: "nvme nvme1: I/O QID 1 timeout, aborting" or "nvme nvme1: I/O 24 QID 0 timeout, reset controller." Later in logs "nvme nvme1: Abort status: 0x371" appears. Once the system boots, the 4TB NVME device is not present anywhere. No device, no block device, nothing. Seagate firmware utility is unable to locate the device in the system.

I tried moving the NVME slot the 4TB device is in, tried using just the 4TB NVME and no 2TB NVME, and no difference. I thought heat may be a culprit (the 530 M.2's run hot) I put a big heatsink on the 4TB, and then even tried laying an icepack against the heatsink. Still, 2TB recognized fine (no heatsink), 4TB timeouts / ignored (big heatsink, then big heatsink+icepack). Unfortunately, when selecting the original "Proxmox Virtual Environment (5.15.35-1-pve)" during the boot menu prompt (I assume that is the kernel version from the May 7.2 release) the 4TB NVME is still not recognized.

Moving the 4TB NVME to an external USB enclosure, and it works fine. VM images start off it like normal (although much slower than when PCIE attached). Also, booting the original Proxmox May 7.2 install image from USB and the 4TB works fine when plugged into the PCIE slot alongside the 2TB NVME. Additionally, booting Archlinux off USB and both PCIE attached NVME's are recognized with no problem. I upgraded to the latest Seagate firmware on both NVME's with no difference.

I am thinking of just installing PVE 7.2 base fresh again on the RAID-1 SSD pair. My assumption is something got borked during the upgrade that even affected the old 5.15.35-1 kernel remaining in the boot menu. Hopefully a fresh install will get me back to the happy pre-5.15.39-3 days. Do I need to do anything special to preserve the VM images on the NVME? I should just be able to register that LVM-Thin volume group with the newly installed PVE, and all the images will be there, right? Or is there any part of the PVE config that I will need to preserve across installations to be able to run those VM's again?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!