I'm running a cluster at home. I know the recommended number is 3 nodes in a cluster, I'm running two. I simply am not able to afford a third at this time, so please overlook that fact.
I have two nodes, node 1 has been a fucking champ, node 2 has mucked up twice now. They are identical hardware, and are only about a year old.
About a month or two ago (forget the exact date/time), node2 buggered up real good (effectively the same how it is now) and started having kernel panics when I would try to migrate VMs, or do anything fairly intensive. At the time I performed rigorous CPU, RAM and other hardware stress testing to try to identify a potential cause, and I came up with no observable hardware failure. So what did I do? I removed the node from the cluster, re-installed Proxmox VE on the host, re-added it to the cluster, and it's been operating like a champ since... until today.
I am seeing the same effects today as previously. To add a bit more detail, sometimes I see a kernel panic, sometimes I see a hard-reset of node2. For the last month or two, I have been able to safely migrate VMs back and forth online (online seems to trigger the vomiting) ever since I fixed it last, ~1-2mo, and all other expected functionality has appeared to be just fine.
I looked at the logs, and I saw some oddities with the nightly backups from last night. I'm unsure if it's the cause, or if it's the canary. I woke up this morning to find 3 of my VMs stuck at grub, as if they had been hard-reset. Node2 was responding, and I thought this was odd. What I had assumed was perhaps a backup had failed part way through and it had reset the VM (less than ideal, but better than it could be). Then I began to interact with the VMs, and the kernel panics, and node2 hard-resets started happening.
I'm sure there's logs and stuff I can paste here, but I'm not certain which ones are desirable in this case.
I figure it would be prudent to try and find the cause of this so perhaps either I can fix it on my end, or a fix can be made, instead of simply nuking the node and re-installing it. Both nodes are fully updated as of this morning, and the effects persist.
Please advise.
I have two nodes, node 1 has been a fucking champ, node 2 has mucked up twice now. They are identical hardware, and are only about a year old.
About a month or two ago (forget the exact date/time), node2 buggered up real good (effectively the same how it is now) and started having kernel panics when I would try to migrate VMs, or do anything fairly intensive. At the time I performed rigorous CPU, RAM and other hardware stress testing to try to identify a potential cause, and I came up with no observable hardware failure. So what did I do? I removed the node from the cluster, re-installed Proxmox VE on the host, re-added it to the cluster, and it's been operating like a champ since... until today.
I am seeing the same effects today as previously. To add a bit more detail, sometimes I see a kernel panic, sometimes I see a hard-reset of node2. For the last month or two, I have been able to safely migrate VMs back and forth online (online seems to trigger the vomiting) ever since I fixed it last, ~1-2mo, and all other expected functionality has appeared to be just fine.
I looked at the logs, and I saw some oddities with the nightly backups from last night. I'm unsure if it's the cause, or if it's the canary. I woke up this morning to find 3 of my VMs stuck at grub, as if they had been hard-reset. Node2 was responding, and I thought this was odd. What I had assumed was perhaps a backup had failed part way through and it had reset the VM (less than ideal, but better than it could be). Then I began to interact with the VMs, and the kernel panics, and node2 hard-resets started happening.
I'm sure there's logs and stuff I can paste here, but I'm not certain which ones are desirable in this case.
I figure it would be prudent to try and find the cause of this so perhaps either I can fix it on my end, or a fix can be made, instead of simply nuking the node and re-installing it. Both nodes are fully updated as of this morning, and the effects persist.
Please advise.