A PVE v7 install gets corrupted to the point of hanging regularly (would like some input)

kernull

Active Member
Apr 11, 2022
47
4
28
A very remote PVE host of mine would go totally unresponsive sometimes only minutes after a restart and after zero luck at remote troubleshooting, I waited to travel to get hands on only to discover that I was just as confused troubleshooting locally...

after much cursing and testing I discovered a single bad address in ram and removed 2/4 dimms (tryin to retain DDR4 speeds) and attempted reboot to see what impact running with this bad address for who knows how long...

Everything seems to lock up in a fashion that does not allow for any sort of kernel debugging- attempted kdump and netconsole, but no luck.

Is there anyway to repair an pve installation?

also, before I ask that I should disclose I may have made things worse (as usual) by trying to see what things would look like if I installed pve9 on another flashdrive and attempted to import the zpool from the nvmedrives and I was quickly reminded that I don't know enough about zfs to make decisions like this...

Anyone have any informed suggestions?

thanks for reading.
 
Hi @kernull

thanks for posting in the forum!

First of all i have to stress the fact that PVE 7 is EOL since 07/2024 and such poses a security risk. Please consider upgrading to PVE 9.

To answer your question, we need a little more details on the error you are currently experiencing.
Did you identify and replace the bad memory stick or is it still installed?

also, before I ask that I should disclose I may have made things worse (as usual) by trying to see what things would look like if I installed pve9 on another flashdrive and attempted to import the zpool from the nvmedrives and I was quickly reminded that I don't know enough about zfs to make decisions like this...
Importing the zpool without applying any form of feature upgrades or similar shouldn't impair the pool, so no worries there.

Is there anyway to repair an pve installation?
In general the preferred route for a corrupt install is a config backup and reinstall of the system, since it is difficult to reliably determine every last corrupt file.

Yours sincerely
Jonas
 
>Everything seems to lock up in a fashion that does not allow for any sort of kernel debugging- attempted kdump and netconsole, but no luck.

you mean the system hangs on boot ?

i you removed ram, maybe it locks up because of ram shortage?

you can avoid vm start at grub boot menu with adding the following kernel param (after/instead quiet param):

systemd.mask=pve-guests.service
 
Is there anyway to repair an pve installation?
maybe.

bad ram can be very consequential with a zfs filesystem (as in, it can cause severe data corruption.) I cannot stress enough that you should absolutely make sure your system is stable and ram is healthy before proceeding.

The first order of business is to load a livecd with zfs support (I usually use https://github.com/nchevsky/systemrescue-zfs) and import the pool. you can and should do so readonly. as @j.theisen suggested, resist the urge to zpool upgrade!

If it imports cleanly, run a scrub. if scrub completes and zpool status reports no errors, zpool export it and attempt to boot.

IF it boots- hurray! you're done. If it doesnt- backup on your vms, reinstall from scratch, and restore your vms. yes its a bit of a pain but that is the most direct way to regain function. How to back up vms you ask? simplest way is the livecd method; you can then use zfs snapshot/send or dd, whichever is supported by your backup destination.