There are a lot of threads about nodes fencing themselves for unknown reasons. Here's my story, with my working solution. Hopefully someone finds it in the future and is able to save some of the hair on their head.
I had been running Proxmox for a solid year on 2 tiny Beelink SER5 Pro's (AMD Ryzen 7 5700U) with 64GB of RAM in each, with extreme success. +1 SSD just for 1-disk ZFS. Unbelievably smooth, using just the single onboard nic that reports as a RTL8111/8168/8211/8411, using a raspberry pi to fill out the quorum. Zero issues whatsoever, this was a workhorse of a 2-node cluster!
Then, 6 months ago I swapped the PI for another SER5 Pro (AMD Ryzen 7 5825U) using a RTL8125, and the problems began. The new node would randomly fence itself, going into a state where it required a physical power pull to restart the system. It was still powered, but with only the power light on the front: no video output and no network activity. I replaced the system NVME and my VM SSD, the RAM, and eventually even the system as a whole and still kept having the problem. I was keeping the cluster updated on code.
As a patch, I added netBooter NP-05B, to automatically bounce the outlet whenever the node went offline. I then noticed the lockup was happening quite often, sometimes a couple of times a day; it didn't seem to correlate with backup or replication schedules.
I added 2 USB RTL8125 Nics to each node (I know the problems with such adapters), set them to LACP layer3/4 (+ CLI change for LACP fast) and allocated the bond for VM and management usage. I reserved the onboard nics for ONLY corosync on a private vlan. Still, the node would fence itself; the onboard nic would show as disconnected on my switch. Didn't make much sense at all.
Solution:
There were 2 changes I made that stabilized the system:
1) Lowered the onboard vram reservation from the default of 4GB down to 1GB in the BIOS.
2) Added the following line to /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"
This stabilized the machine completely, and it has been running for almost a month without issue.
I then went in and removed just the grub customization, and within 2 hours the node had fenced itself.
I don't know why the vram reservation made a difference, but it did. If I don't have both of these things changed, I get the fencing. I suppose it's possible I am hitting 2 different problems simultaneously. I suspect for other system types, likely only the grub modification would be necessary.
TLDR:
Ryzen 7 5825U w/RTL8125 was fencing itself. I isolated corosync traffic, lowered the VRAM allocation and set a Grub kernel parameter and my lockups are a thing of the past.
I had been running Proxmox for a solid year on 2 tiny Beelink SER5 Pro's (AMD Ryzen 7 5700U) with 64GB of RAM in each, with extreme success. +1 SSD just for 1-disk ZFS. Unbelievably smooth, using just the single onboard nic that reports as a RTL8111/8168/8211/8411, using a raspberry pi to fill out the quorum. Zero issues whatsoever, this was a workhorse of a 2-node cluster!
Then, 6 months ago I swapped the PI for another SER5 Pro (AMD Ryzen 7 5825U) using a RTL8125, and the problems began. The new node would randomly fence itself, going into a state where it required a physical power pull to restart the system. It was still powered, but with only the power light on the front: no video output and no network activity. I replaced the system NVME and my VM SSD, the RAM, and eventually even the system as a whole and still kept having the problem. I was keeping the cluster updated on code.
As a patch, I added netBooter NP-05B, to automatically bounce the outlet whenever the node went offline. I then noticed the lockup was happening quite often, sometimes a couple of times a day; it didn't seem to correlate with backup or replication schedules.
I added 2 USB RTL8125 Nics to each node (I know the problems with such adapters), set them to LACP layer3/4 (+ CLI change for LACP fast) and allocated the bond for VM and management usage. I reserved the onboard nics for ONLY corosync on a private vlan. Still, the node would fence itself; the onboard nic would show as disconnected on my switch. Didn't make much sense at all.
Solution:
There were 2 changes I made that stabilized the system:
1) Lowered the onboard vram reservation from the default of 4GB down to 1GB in the BIOS.
2) Added the following line to /etc/default/grub:
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"
This stabilized the machine completely, and it has been running for almost a month without issue.
I then went in and removed just the grub customization, and within 2 hours the node had fenced itself.
I don't know why the vram reservation made a difference, but it did. If I don't have both of these things changed, I get the fencing. I suppose it's possible I am hitting 2 different problems simultaneously. I suspect for other system types, likely only the grub modification would be necessary.
TLDR:
Ryzen 7 5825U w/RTL8125 was fencing itself. I isolated corosync traffic, lowered the VRAM allocation and set a Grub kernel parameter and my lockups are a thing of the past.