Host server keeps restarting randomly

Maxvds

New Member
Mar 2, 2023
5
0
1
Hi,

I am encountering a weird issue. I have a couple of the exact same servers but one of them keeps randomly restarting itself with one of the following errors:
- The watchdog timer power cycled the system.
- CPU 1 machine check error detected.
The interval differs. Sometimes it is everyday, sometimes it is once a month. I have contacted Dell and they think it is an OS error but the Proxmox logs do not indicate any issues. I have already reinstalled Proxmox on a new disk without any luck. Every server is exactly the same. (Same Hardware, Same Firmware on the Hardware, same CPU and Ram)

It is running Proxmox 7.3-4 on a PowerEdge R7525.

Has any of you encountered this issue before and what could be done to further rule out any OS issues or to fix this issue?

Kind regards,
Max
 
Hi,

Since you do not see anything interesting on the Syslog, I would check that the firmware on the server is up-to-date, and if there is a high temperature on the server when it got rebooted. Otherwise, please provide us with more information about your Proxmox VE server:
- It's a cluster? If yes:
* do you have HA running?
* How looks the network configs on the Cluster?
- Did you see the dmesg output?
 
Hi Moayad,

Thanks for the swift response. Yes it is a cluster with HA enabled. Yes, I did see the dmesg logs but it just randomly stops at the time when the server reboots. There are no errors before the reboot.

Network config:
1677762123303.png
 
Can you please provide us with the Corosync `/etc/pve/corosync.conf` configuration and the syslog when the system rebooted? You can sort the syslog with specific time using the journalctl cli, e.g.,:

Bash:
journalctl --since "2023-03-02 00:00" --until "2023-03-02 07:45" > /tmp/Syslog.log
 
Can you please provide us with the Corosync `/etc/pve/corosync.conf` configuration and the syslog when the system rebooted? You can sort the syslog with specific time using the journalctl cli, e.g.,:

Bash:
journalctl --since "2023-03-02 00:00" --until "2023-03-02 07:45" > /tmp/Syslog.log
 

Attachments

not Cstates at all. only Cstate 6, for your specific bios you need to read the manual or google, how the options is actually called.
disabling Cstates completely only, if nothing helps or your board does not have an option to perform typical power at idle (to disable Cstate 6, which is new and AMD only yet. also kernel 6.1 does not work correctly wit 6)
 
Last edited:
not Cstates at all. only Cstate 6, for your specific bios you need to read the manual or google, how the options is actually called.
disabling Cstates completely only, if nothing helps or your board does not have an option to perform typical power at idle (to disable Cstate 6, which is new and AMD only yet. also kernel 6.1 does not work correctly wit 6)
There is no option to disable only Cstate 6. Thanks for the extra exploration!
 
I try to explain it more detailed in my not native english...

AMD introduced with Zen (IIRC Zen gen1 or 2, at least since a few years already) the new CState 6, which is to save power (not to be confused with cool&quiet), This Cstate6 is reached, when your CPU is idling, even for short time. The VRMs or the powersupply chain, i don't know exactly, is instructed to give less power until CPU asks again for more power, when it becomes loaded again. The Linux kernel, and in my observations even 6.1 is not handling that CS6 correctly yet, exactly at that point it seems to happen, when CPU is getting loaded and needs more power again, but doesn't get it, so cores are stalled and crash is happening.

I don't have a dell server with AMD CPUs, but HPE has such setting, which is named confusingly but does exactly disable CS6. sorry, i don't exactly remember how that settings is called on HPE servers with Epyc, it is one of the power modes. Pretty sure, Dell has something eqal, but maybe it is also named confusingly, so no CS6 mentioned in name. Desktop/Workstation boards call it Typical Idle Power sometimes, i only seen once it actually named correctly "disable Cstate 6" on some ARock Rack board.

guess you will have here some try and error + searching for your model specific forums.

don't know, if my explanation is technically 100% correct, but you got the idea. Try it and you will enjoy your AMDs.

Best option is to disable Cstate6 (however the option is called in your bios/uefi).
if really not available you could run some load script, that takes 1% all the time (shitty workaround, but you have still steppings _AND_ turbo boost)
as last resort, you disable Cstates completely and loose turbo boost and your cores are alway running at full base frequency, which eats energy. maybe next kernel will have full support for cs6.

just to clarify: it is not a proxmox issue, any linux flavor and older windows versions are affected. don't know about bleeding edge kernel versions, never tried.
 
Hi,

Thank you for the syslog and corosync config.

Based on the Corosync config you provided, we highly recommend adding a separate network and/or a new ring to the Corosync [0].

From the Syslog you provided, there is a bunch of "watchdog update failed - Broken pipe" messages on the HA, can you try to restart the pve-ha-lrm services then see the status?

Bash:
systemctl restart pve-ha-crm.service
systemctl status pve-ha-crm.service


[0] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#pvecm_redundancy
 
really? is that some time shaking, before you stick that CState6 fix finally on the forum (not especially mine formulations, do a sticky with a clean explanation from you). or enjoy every few days the same questions. poeple don't use search, seemingly, and your proxmox mods do ask over and over again the same questions, it is 95%++ of those failures with amd are fixed by just disabling cs6 in bios/euefi or loading cpu, so it never gets idle.
 
Last edited:
There is no option to disable only Cstate 6. Thanks for the extra exploration!
curiously asking, have you found the option yet? just to have it documented here for future, how it is called in your motherboards bios/uefi setting... thanks in advance.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!