PVE server hanging on reboot

macstibs

New Member
Jul 22, 2020
3
0
1
51
Looking for some help diagnosing my problem and maybe even finding a solution.

I have a Supermicro x9dri-ln4f+ (dual CPU, 32GB) that was running PVE flawlessly until I took it offline to add a GPU for a Windows VM. The PVE install is running off a USB stick rather than a SSD.
There's also an LSI Megaraid 9240-4i but its only used for storage and all the drives are passed through as single drive RAID 0s into a ZFS pool. Last but not least theres a supermicro nvme to PCIE card with a single NVME drive that I use for VMs.

I made a bunch of bone-headed mistakes along the way and now can't seem to unbork it or even determine what specifically is borked.

1. I moved the two existing cards around to improve airflow off the GPU fans and RAID heatsink and didnt note where the cards were originally.

2. I added the GPU without adding the necessary blacklist options to GRUB.

3. When it didn't boot, I pulled the GPU card and tried to reset the cards to their original slots (but not 100% I've got it right)

4. Even after pulling the GPU, the server still won't come back up.

5. Using the Rescue option from the 6.2.1 installer CD hangs.

6. Using the E option from the boot menu and changing the options from quiet to debug sheds a little light but not much.

The last entries from the console during boot are
Starting Load/save random seed
Started Load/save random seed
*Looooong pause*
"sent watchdog=1 notification"

So I'm not sure what the problem is, but I think the random seed isnt the cause of the hang because the started entry should mean it completed its startup.

It's probable moving the cards around screwed something up or adding the GPU screwed something up.

I don't think it's the USB boot drive because the rescue CD is failing too; but its possible the USB drive was originally 5.2 and was upgraded over time to 6.1 or 6.2 using the CLI package manager and is now borked like the CD (ie, the 6.2.1 CD rescue option has the same bug with UEFI installs as the updated USB system).

Any help would be mightily appreciated.
 
Last edited:
Hoping someone might have suggestions on this.

Pretty much at a loss of what to even try to fix it.
 
Just want to leave a note on this in case anyone else needs help.

As best as I can tell, somewhere between GRUB, the kernel and a botched apt update, the system broke.

I reinstalled from the 6.2 cd onto a new drive, chrooted into the old drive and recovered the config.db along with a handful of other files. No loss of data.

GPU install was coincidental but not the root cause. After configuring the passthrough parameters, it's now isolated in a W10 vm.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!