Hi all,
I am looking for some assistance in understanding stability issues presented on my home cluster (at worst this info can hopefully help someone find a temporary fix for their environment). I am less knowledgeable than I wish I were on troubleshooting Linux systems and have been throwing spaghetti against the wall (Hopefully in a controlled fashion to try to figure out where the issue lies).
The primary symptom of the issue:
Complete host system hang after short periods of runtime time. 1-3 Hours of runtime usually, but shortest time experienced was about 10 minutes. (Potentially aggravated by VMs running, but unable to verify beyond testimonial)
This is an issue I have been fighting with since I decided to try to run Proxmox VE on these systems. This was before 9.0 was out so I was on a base version of 8 on both nodes. Initial information I found suggested that CState switching on the Zen 1-2 architecture in early supported versions of the kernel was not the most stable, and folks had found stability in disabling lower CStates.
It is a home lab, so on release of 9.0, I went ahead and updated my nodes. Immediately the issue seemed to return. I went ahead and tried to rebuild a node on a fresh install to see if I messed up the upgrae, but had no luck. I then set max cstate again in /etc/kernel/cmdline (Or it was already set, I can’t remember) and verified that it was 1. I figured it might be the kernel change as that was a major change between versions, but said… I will deal with this another day and have been too busy, so I let it sit.
Well, the new kernel option got me curious, so I hopped back into my home lab to play, hoping the new kernel would make it stable again, and no luck… Because it is now top of mind, I also figured, lets go ahead and roll back the kernel to a previous version and see what happens. After setting the kernel with “pve-efiboot-tool kernel pin 6.8.12-15-pve --next-boot” and verifying /sys/module/processor/parameters/max_cstate returns 1, system has been stable for 24 hours…
You may have noticed a lack of mentioning journalctl, dmesg, and syslog… This is where I was better at linux administration and could use a little help as there are still more questions, and I don’t know what to look for in the logs to know.
Since the logs don’t seem to persist through a reboot, I can setup a remote syslog collector. But journalctl and dmesg seem to clear on reboot, so I don’t know what messages may be leading up to the fault. I do know that the time I left a monitor plugged in until the system froze, I got the following message multiple times that I assume is related to CPU Interrupts.
“[Timestamp] perf: interrupt took too long (XXXX > YYYY), lowering kernel.perf_event_max_sample_rate to ZZZZZ”
I am aware that interrupts transition processor cstates, so a more active cstate reducing interrupt time makes sense to me (And I have yet to cycle down cstates to figure out if 2/3/4 are stable), but I would still love to know two things…
Any guidance y’all can provide in my efforts to find these answers is appreciated!
(Edited Spelling and Clarity - 11/29)
(Edited Another Typo - 12/1)
I am looking for some assistance in understanding stability issues presented on my home cluster (at worst this info can hopefully help someone find a temporary fix for their environment). I am less knowledgeable than I wish I were on troubleshooting Linux systems and have been throwing spaghetti against the wall (Hopefully in a controlled fashion to try to figure out where the issue lies).
The primary symptom of the issue:
Complete host system hang after short periods of runtime time. 1-3 Hours of runtime usually, but shortest time experienced was about 10 minutes. (Potentially aggravated by VMs running, but unable to verify beyond testimonial)
- Display continues to function/display frozen screen if already pugged in and powered on, but plugging in a monitor if running headless results in the monitor falling into power saving mode.
- Keyboard input is not registered including changing TTY lines and Ctrl+Alt+Delete spam.
- On hang, host IP reachability fails and all VMs (If any were running) become unreachable, but network switch still shows 1000/Full negotiated and Link Layer up.
- 2x Lenovo M715Q Tiny
- (https://www.lenovo.com/us/en/p/desktops/thinkcentre/m-series-tiny/thinkcentre-m715q-tiny/11tc1mt715q?orgRef=https%3A%2F%2Fwww.google.com%2F&srsltid=AfmBOorVz2ONUmhJpbpOCeSkWM28SXo25DN3PEr9owbVohUag9anhBUP)
- Each with:
- Ryzen 5 2400GE CPU/APU
- 2 x 8 GB DDR4 SODIMMs (16GB)
- 1x 256Gb SSD
- Latest manufacturer BIOS update
- I had a small VM to act as a quorum witness, but have been tearing down and rebuilding the cluster on occasion, so it is currently not present, but I had no issues with it when it was there.
- Proxmox VE 9.1.1 – UEFI Install (Haven’t tried BIOS)
- One was an upgrade from 8X
- One is a direct installation from 9.0
- Currently one system is installed on an LVM partition, the other is installed on a ZFS partition
- Various Kernels:
- 6.8.12-9-pve – Functional with processor.max_cstate=1, failure without.
- 6.8.12-15-pve – Functional with processor.max_cstate=1, failure without.
- 6.14.11-2-pve – Non-functional with processor.max_cstate=1
- 6.17.2-2-pve – Non-functional with processor.max_cstate=1
- Usually clustered, but in troubleshooting, issues still occur without.
This is an issue I have been fighting with since I decided to try to run Proxmox VE on these systems. This was before 9.0 was out so I was on a base version of 8 on both nodes. Initial information I found suggested that CState switching on the Zen 1-2 architecture in early supported versions of the kernel was not the most stable, and folks had found stability in disabling lower CStates.
- My first point was to disable C6 in the bios and see if we had any luck. – This yielded no observable change.
- Next I modified etc/default/grub to add “processor.max_cstate=1” to GRUB_CMDLINE_LINUX_DEFAULT, generated a config and rebooted only to find when reading /sys/module/processor/parameters/max_cstate that the change didn’t take effect – Obviously this yielded no observable change. (Unless I am misremembering)
- After a little bit of RTFM… I modified /etc/kernel/cmdline and appended “processor.max_cstate=1” to the entry, followed by “pve-efiboot-tool refresh” and a reboot. reading /sys/module/processor/parameters/max_cstate returned 1 as expected. – Though I didn’t love the idea of increased power consumption, this was a valid workaround.
It is a home lab, so on release of 9.0, I went ahead and updated my nodes. Immediately the issue seemed to return. I went ahead and tried to rebuild a node on a fresh install to see if I messed up the upgrae, but had no luck. I then set max cstate again in /etc/kernel/cmdline (Or it was already set, I can’t remember) and verified that it was 1. I figured it might be the kernel change as that was a major change between versions, but said… I will deal with this another day and have been too busy, so I let it sit.
Well, the new kernel option got me curious, so I hopped back into my home lab to play, hoping the new kernel would make it stable again, and no luck… Because it is now top of mind, I also figured, lets go ahead and roll back the kernel to a previous version and see what happens. After setting the kernel with “pve-efiboot-tool kernel pin 6.8.12-15-pve --next-boot” and verifying /sys/module/processor/parameters/max_cstate returns 1, system has been stable for 24 hours…
You may have noticed a lack of mentioning journalctl, dmesg, and syslog… This is where I was better at linux administration and could use a little help as there are still more questions, and I don’t know what to look for in the logs to know.
Since the logs don’t seem to persist through a reboot, I can setup a remote syslog collector. But journalctl and dmesg seem to clear on reboot, so I don’t know what messages may be leading up to the fault. I do know that the time I left a monitor plugged in until the system froze, I got the following message multiple times that I assume is related to CPU Interrupts.
“[Timestamp] perf: interrupt took too long (XXXX > YYYY), lowering kernel.perf_event_max_sample_rate to ZZZZZ”
- Where:
- XXXX is the time the interrupt took,
- YYYY is the threshold,
- and ZZZZZ is the sample rate.
I am aware that interrupts transition processor cstates, so a more active cstate reducing interrupt time makes sense to me (And I have yet to cycle down cstates to figure out if 2/3/4 are stable), but I would still love to know two things…
- The R5 2400GE was a relatively mainstream CPU, and seems to be supported in modern kernels without widely reported issues. What interaction between the kernel and proxmox could be causing this, and is max CState modification the most appropriate fix?
- What changed between 6.8.X and later kernels to cause it to only work on 6.8.
Any guidance y’all can provide in my efforts to find these answers is appreciated!
(Edited Spelling and Clarity - 11/29)
(Edited Another Typo - 12/1)
Last edited: