Hi,
I'm running a 3 node PVE cluster in my homelab, with pvenode02 running an Intel 14900k on an ASUS W680 ACE IPMI board with an RTX4090 passed through.
2 Weeks ago I ran a kernel update on all nodes to the latest current kernel, 6.12.8-pve I believe. In doing so, this node didn't come back up with seg faults in different modules and what not. In googling I came across different possible solutions and causes for this, most of them due to kernel, memory or CPU problems.
I booted with an older kernel, but were unable to resolve the issue. I ran memtest, which came So I bit the bullet and removed the node from the cluster, wiped the disk and reinstalled everything, rejoined it under the old name and old Ip setup (which thankfully doesn't result in 100s of ssh fingerprints errors anymore) and got it back up running.
Since the CPU hasn't been replaced by Intel so far due to the instability issues, I at first was hesitant but finally ran a microcode update for it. And after a reboot I'm now back where I started. The host either won't boot at all or will lock up after some runtime with different error. PFA a sys log dump I was able to copy from the console before the most recent lockup. The older 6.12.4 kernel doesn't work as well.
Could someone please verify, that the CPU would be the culprit here? This setup ran pretty stable for the last 6 months and just now is no longer stable. This would be expected if the CPU degraded over time.
I'm running a 3 node PVE cluster in my homelab, with pvenode02 running an Intel 14900k on an ASUS W680 ACE IPMI board with an RTX4090 passed through.
2 Weeks ago I ran a kernel update on all nodes to the latest current kernel, 6.12.8-pve I believe. In doing so, this node didn't come back up with seg faults in different modules and what not. In googling I came across different possible solutions and causes for this, most of them due to kernel, memory or CPU problems.
I booted with an older kernel, but were unable to resolve the issue. I ran memtest, which came So I bit the bullet and removed the node from the cluster, wiped the disk and reinstalled everything, rejoined it under the old name and old Ip setup (which thankfully doesn't result in 100s of ssh fingerprints errors anymore) and got it back up running.
Since the CPU hasn't been replaced by Intel so far due to the instability issues, I at first was hesitant but finally ran a microcode update for it. And after a reboot I'm now back where I started. The host either won't boot at all or will lock up after some runtime with different error. PFA a sys log dump I was able to copy from the console before the most recent lockup. The older 6.12.4 kernel doesn't work as well.
Could someone please verify, that the CPU would be the culprit here? This setup ran pretty stable for the last 6 months and just now is no longer stable. This would be expected if the CPU degraded over time.