HI all,
I was running into random crashes after updating proxmox to 8.2 on my 3 nodes cluster (running Elitedesk 800 G4).
I was planning to ask a question on the forum, but after struggling for more than 2 days on it I have managed to solve it (hopefully).
So now, I just share my solution (with debugging steps) in the hope it will help some of you.
1. Problem description
I run a 3 nodes cluster with each node being identical (HP Elitedesk 800 G4, 16GB memory).
After I have upgraded to 8.2 (for 7.15), I ran into weird "crashes": Some nodes were randomly restarting.
So I look into the forum and all, but could not find a remedy.
2. Problem identification
I was thinking it might be linked to a power issue, but since it was happening on the 3 nodes randomly, I could not have 3 different issues with transformers.
To clear the power management issue, I decided to reduce the load on 1 server: transfered the instances that was running on it on the 2 different nodes and look at what happened.
Result: this node was running not longer than 20 minutes until reboot. Nothing in the logs, nothing in the kernel logs, nothing in the journctl...
Looked everywhere. But it was clearly linked to power management.
3. Problem solution
In my particular case, I had to change the max cstate to 7 in nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on intel_idle.max_cstate=7 i915.enable_dc=0 ahci.mobile_lpm_policy=1"
Then do a update-grub, followed by proxmox-boot-tool refresh and then reboot.
In my case, the 20 minutes reboot server is now online for more than 24H.
I hope it helps
I was running into random crashes after updating proxmox to 8.2 on my 3 nodes cluster (running Elitedesk 800 G4).
I was planning to ask a question on the forum, but after struggling for more than 2 days on it I have managed to solve it (hopefully).
So now, I just share my solution (with debugging steps) in the hope it will help some of you.
1. Problem description
I run a 3 nodes cluster with each node being identical (HP Elitedesk 800 G4, 16GB memory).
After I have upgraded to 8.2 (for 7.15), I ran into weird "crashes": Some nodes were randomly restarting.
So I look into the forum and all, but could not find a remedy.
2. Problem identification
I was thinking it might be linked to a power issue, but since it was happening on the 3 nodes randomly, I could not have 3 different issues with transformers.
To clear the power management issue, I decided to reduce the load on 1 server: transfered the instances that was running on it on the 2 different nodes and look at what happened.
Result: this node was running not longer than 20 minutes until reboot. Nothing in the logs, nothing in the kernel logs, nothing in the journctl...
Looked everywhere. But it was clearly linked to power management.
3. Problem solution
In my particular case, I had to change the max cstate to 7 in nano /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on intel_idle.max_cstate=7 i915.enable_dc=0 ahci.mobile_lpm_policy=1"
Then do a update-grub, followed by proxmox-boot-tool refresh and then reboot.
In my case, the 20 minutes reboot server is now online for more than 24H.
I hope it helps