Node won't come back up after restart

WoRie

Member
Sep 11, 2022
13
3
8
Hi,

I'm running a 3 node PVE cluster in my homelab, with pvenode02 running an Intel 14900k on an ASUS W680 ACE IPMI board with an RTX4090 passed through.

2 Weeks ago I ran a kernel update on all nodes to the latest current kernel, 6.12.8-pve I believe. In doing so, this node didn't come back up with seg faults in different modules and what not. In googling I came across different possible solutions and causes for this, most of them due to kernel, memory or CPU problems.

I booted with an older kernel, but were unable to resolve the issue. I ran memtest, which came So I bit the bullet and removed the node from the cluster, wiped the disk and reinstalled everything, rejoined it under the old name and old Ip setup (which thankfully doesn't result in 100s of ssh fingerprints errors anymore) and got it back up running.

Since the CPU hasn't been replaced by Intel so far due to the instability issues, I at first was hesitant but finally ran a microcode update for it. And after a reboot I'm now back where I started. The host either won't boot at all or will lock up after some runtime with different error. PFA a sys log dump I was able to copy from the console before the most recent lockup. The older 6.12.4 kernel doesn't work as well.

Could someone please verify, that the CPU would be the culprit here? This setup ran pretty stable for the last 6 months and just now is no longer stable. This would be expected if the CPU degraded over time.
 

Attachments

Hello WoRie! I see that you have the latest BIOS update - that's very good, especially considering the CPU issues from Intel. Also very good that you installed the newest microcode update from the Debian repositories. The journal log you posted indeed shows system instability. This could also be caused by faulty RAM, but also due to a degraded CPU.

So could you please run memtest86+ to see if you see any RAM errors?

Also, just wondering, when did you install the BIOS update? Was it after the issues started happening?
 
Hi, thanks for the reply. I somehow "stopped caring mid sentence" while typing the above ;) It should have read that I ran Memtest without issues. This was the second thing I did after switching the kernel.

I installed the BIOS update only recently, but before that I didn't run any hard overclocks on this hardware. Possibly the factory voltages from ASUS were too high anyway even with everything on auto, so possibly the CPU destroyed itself over time.

As I stated, I reopened my ticket with Intel and the CPU is due to be picked up in two days. So I'll see then if this fixes my issues. I will leave the system as is apart from that (and my cluster degraded with only 2/3 nodes available currently)
 
Thanks for the update! Good to know that memtest worked without issues. While this does not mean with 100% probability that the RAM is not the issue, I would rather assume it is not. Of course, if you have some RAM sticks you can swap for testing, feel free to do that anyway.

It's already a bad sign that you reverted the kernel update and it didn't help, although it worked without issues until now. Something else you can try would be to install the opt-in kernel 6.11, but I somehow doubt this will help. It's probably worth trying before swapping the hardware, though.

Otherwise the only other thing you can try at this point is to swap RAM / CPU / motherboard and see if anything helps. I would say it's probably a CPU issue, but you'll have to test whether that is actually true.

As I stated, I reopened my ticket with Intel and the CPU is due to be picked up in two days. So I'll see then if this fixes my issues. I will leave the system as is apart from that (and my cluster degraded with only 2/3 nodes available currently)
Good luck with replacing the CPU! I hope this will fix your issues. Don't forget to keep the BIOS up-to-date and the microcode updates installed even for the new CPU, just in case ;)