[SOLVED] Proxmox restarting regularly since 7.3/7.4 upgrade

Mar 14, 2021
8
4
8
51
Hi, I'm running the latest proxmox non-commercial distribution (7.4-3) after upgrading a few weeks ago from a 7.2-2 I had been running since June 2022. I had a few challenges doing the upgrade on my ASUS motherboard, having to disable C-states in the BIOS to even enable the machine to reliably boot.

System info:
ASUS PRIME X570-P Motherboard
AMD Ryzen 5 5600G
64Gb RAM

Current installed packages
Code:
proxmox-archive-keyring/stable,now 2.2 all [installed]
proxmox-backup-client/stable,now 2.4.1-1 amd64 [installed]
proxmox-backup-file-restore/stable,now 2.4.1-1 amd64 [installed]
proxmox-backup-restore-image/stable,now 0.3.1 amd64 [installed]
proxmox-kernel-helper/stable,now 7.4-1 all [installed,automatic]
proxmox-mail-forward/stable,now 0.1.1-1 amd64 [installed,automatic]
proxmox-mini-journalreader/stable,now 1.3-1 amd64 [installed]
proxmox-offline-mirror-docs/stable,now 0.5.1-1 all [installed,automatic]
proxmox-offline-mirror-helper/stable,now 0.5.1-1 amd64 [installed,automatic]
proxmox-ve/stable,now 7.4-1 all [installed]
proxmox-websocket-tunnel/stable,now 0.1.0-1 amd64 [installed,automatic]
proxmox-widget-toolkit/stable,now 3.6.5 all [installed]
psmisc/stable,now 23.4-2 amd64 [installed]
pve-cluster/stable,now 7.3-3 amd64 [installed]
pve-container/stable,now 4.4-3 all [installed]
pve-docs/stable,now 7.4-2 all [installed]
pve-edk2-firmware/stable,now 3.20230228-2 all [installed]
pve-firewall/stable,now 4.3-1 amd64 [installed]
pve-firmware/stable,now 3.6-4 all [installed]
pve-ha-manager/stable,now 3.6.0 amd64 [installed]
pve-i18n/stable,now 2.12-1 all [installed]
pve-kernel-5.13.19-2-pve/stable,now 5.13.19-4 amd64 [installed]
pve-kernel-5.13.19-6-pve/stable,now 5.13.19-15 amd64 [installed,automatic]
pve-kernel-5.13/stable,now 7.1-9 all [installed]
pve-kernel-5.15.104-1-pve/stable,now 5.15.104-1 amd64 [installed,automatic]
pve-kernel-5.15.85-1-pve/stable,now 5.15.85-1 amd64 [installed,auto-removable]
pve-kernel-5.15/stable,now 7.4-1 all [installed,automatic]
pve-kernel-6.1.15-1-pve/stable,now 6.1.15-1 amd64 [installed,automatic]
pve-kernel-6.2.6-1-pve/stable,now 6.2.6-1 amd64 [installed,automatic]
pve-kernel-6.2.9-1-pve/stable,now 6.2.9-1 amd64 [installed,automatic]
pve-kernel-6.2/stable,now 7.4-1 all [installed]
pve-lxc-syscalld/stable,now 1.2.2-1 amd64 [installed]
pve-manager/stable,now 7.4-3 amd64 [installed]
pve-qemu-kvm/stable,now 7.2.0-8 amd64 [installed]
pve-xtermjs/stable,now 4.16.0-1 amd64 [installed]

I'm running 4 guest VMs, a mixture of Linux and 1x Windows OSs. There is a 500g Western Digital NVMe drive as the system disk. The VMs are stored on that disk, with most also accessing NFS stores on a NAS over ethernet.

I've seen from some other threads (https://forum.proxmox.com/threads/proxmox-keeps-crashing.117837/) that there are some existing issues with AMD CPUs. As such I've already tried a bunch of things, none of which appear to help:
  • disabled global c-states in the BIOS
  • tried opt-in kernels 5.19, 6.1 & 6.2 (none seem to make any difference)
  • other BIOS tweaks (enabled/disable PSS Support, Enable IOMMU, Disable Precision Boost Overdrive)
  • updated BIOS to latest version 4602 from 4021 (ASUS BIOS versions)
I don't see anything that looks particularly helpful in /var/log/messages at the time of the crash. Here is the output on journalctl -p err

Code:
-- Boot f4d9ec9a2bc540699a3f07b9651f43e2 --
Apr 08 09:20:33 bigboy upsd[1290]: not listening on 127.0.0.1 port 3493
Apr 08 09:21:33 bigboy kernel: usb 3-4.2: 3:3: cannot set freq 24000 to ep 0x82
-- Boot 8b11a55bd0e54c4b8c2627b169feb56a --
Apr 08 10:01:01 bigboy kernel: usb 3-4.2: 3:3: cannot set freq 24000 to ep 0x82
Apr 08 10:01:09 bigboy upsd[1287]: not listening on 127.0.0.1 port 3493

(the machine is running NUT, with a CyberPower UPS connected)

The Proxmox host is now restarting a number of times per day. Sometimes it will last an hour, sometimes 4-10 hours. Prior to the 7.x upgrade, it was very stable - the more I tweak it, the more unstable it appears to become. I'm not sure where else to go from here, appreciate any suggestions and assistance.
 
Last edited:
I've now tried;

* removing every USB device except for the keyboard.
* run a 2 pass memtest86 test with no memory errors.

No change to reboot behaviour.
 
I've done some further reading / googling and found a few posts of Ubuntu and Debian users having issues with X570 and/or Ryzen CPUs (here and here) and these have been helped to solve my problem. I'm not sure why this was fine for so long but started to be an issue with 7.3/7.4. My installation history is as follows:

28 Jun 2022 - 7.2-2
3 Mar 2023 - 7.3-1
5 Apr 2023 - 7.4-2

I began to have unexplainable reboots from 7.3-1. I upgraded my BIOS and to Proxmox 7.4-2 to try and fix those issues, but the rebooting went from a few times a day to a few times an hour.

After much tearing out of hair, I found a fix!

SOLUTION
I've tweaked the kernel command line by editing /etc/default/grub file, replacing the existing GRUB_CMDLINE_LINUX_DEFAULT value

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

and then rebuilt the kernel by running `update-grub`.

Previously, I also installed the opt-in kernels. So I'm currently running most recent 6.2 version (6.2.9-1-pve). I'm not sure this is required or not, but it's not hurting.

I'm no expert, but at a pinch I'd say the `apicmaintimer` directive (more here) is largely doing the heavy lifting here. Here's what each of the new directives are apparently doing:

  • pci=assign-busses: This option instructs the kernel to assign bus numbers to PCI devices in a deterministic manner. This can be useful for ensuring that device enumeration is consistent across reboots.
  • apicmaintimer: This option specifies the use of the Advanced Programmable Interrupt Controller (APIC) timer for system timing. This can provide more accurate and reliable timing information than other timers.
  • idle=poll: This option specifies the use of the CPU's polling mechanism for idle tasks. This can improve performance in some cases, but may increase power consumption.
  • reboot=cold,hard: This option specifies the behavior of the system when a reboot command is issued. The cold option specifies a full system reset, while the hard option specifies a hard reset of the system. This can be useful for troubleshooting system stability issues.

After making the above changes, I'm over 24 hours with zero crashes. I wasn't able to be stable for more than 2 hours since April 5th update to 7.4-2. This overall appears to be a Debian/AMD issue from upstream that is fairly nasty. Is there a best way to log this as a bug that can be addresses so others don't have the same experience I did?
 
I've done some further reading / googling and found a few posts of Ubuntu and Debian users having issues with X570 and/or Ryzen CPUs (here and here) and these have been helped to solve my problem. I'm not sure why this was fine for so long but started to be an issue with 7.3/7.4. My installation history is as follows:

28 Jun 2022 - 7.2-2
3 Mar 2023 - 7.3-1
5 Apr 2023 - 7.4-2

I began to have unexplainable reboots from 7.3-1. I upgraded my BIOS and to Proxmox 7.4-2 to try and fix those issues, but the rebooting went from a few times a day to a few times an hour.

After much tearing out of hair, I found a fix!

SOLUTION
I've tweaked the kernel command line by editing /etc/default/grub file, replacing the existing GRUB_CMDLINE_LINUX_DEFAULT value

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"

and then rebuilt the kernel by running `update-grub`.

Previously, I also installed the opt-in kernels. So I'm currently running most recent 6.2 version (6.2.9-1-pve). I'm not sure this is required or not, but it's not hurting.

I'm no expert, but at a pinch I'd say the `apicmaintimer` directive (more here) is largely doing the heavy lifting here. Here's what each of the new directives are apparently doing:



After making the above changes, I'm over 24 hours with zero crashes. I wasn't able to be stable for more than 2 hours since April 5th update to 7.4-2. This overall appears to be a Debian/AMD issue from upstream that is fairly nasty. Is there a best way to log this as a bug that can be addresses so others don't have the same experience I did?

Well done!
I hope you are still up and firing?
 
began to have unexplainable reboots from 7.3-1. I upgraded my BIOS and to Proxmox 7.4-2 to try and fix those issues, but the rebooting went from a few times a day to a few times an hour.

After much tearing out of hair, I found a fix!

SOLUTION
I've tweaked the kernel command line by editing /etc/default/grub file, replacing the existing GRUB_CMDLINE_LINUX_DEFAULT value

Code:
GRUB_CMDLINE_LINUX_DEFAULT="quiet pci=assign-busses apicmaintimer idle=poll reboot=cold,hard"
and then rebuilt the kernel by running `update-grub`.
I had this same problem with "pve-manager/8.0.4/d258a813cfa6b390 (running kernel: 6.2.16-18-pve)", my system sometimes randomly rebooting and freezing without any explanation. After applying the boot parameters you shared I have no more random reboots or freezes and my system has an update, I just came here to thank for sharing those boot parameters. So thank you for having shared them here!
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!