Workaround for core watchdog issue on ASRock X670E motherboards

johnwbyrd

Member
Mar 28, 2021
9
2
8
55
Here are some breadcrumbs for anyone debugging random reboot issues on Proxmox 8.3.1 or later.

tl:dr; If you're experiencing random unpredictable reboots on a Proxmox rig, try DISABLING (not leaving at Auto) your Core Watchdog Timer in the BIOS.

I have built a Proxmox 8.3 rig with the following specs:
  • CPU: AMD Ryzen 9 7950X3D 4.2 GHz 16-Core Processor
  • CPU Cooler: Noctua NH-D15 82.5 CFM CPU Cooler
  • Motherboard: ASRock X670E Taichi Carrara EATX AM5 Motherboard
  • Memory: 2 x G.Skill Trident Z5 Neo 64 GB (2 x 32 GB) DDR5-6000 CL30 Memory
  • Storage: 4 x Samsung 990 Pro 4 TB M.2-2280 PCIe 4.0 X4 NVME Solid State Drive
  • Storage: 4 x Toshiba MG10 512e 20 TB 3.5" 7200 RPM Internal Hard Drive
  • Video Card: Gigabyte GAMING OC GeForce RTX 4090 24 GB Video Card
  • Case: Corsair 7000D AIRFLOW Full-Tower ATX PC Case — Black
  • Power Supply: be quiet! Dark Power Pro 13 1600 W 80+ Titanium Certified Fully Modular ATX Power Supply
This particular rig, when updated to the latest Proxmox with GPU passthrough as documented at https://pve.proxmox.com/wiki/PCI_Passthrough , showed a behavior where the system would randomly reboot under load, with no indications as to why it was rebooting. Nothing in the Proxmox system log indicated that a hard reboot was about to occur; it merely occurred, and the system would come back up immediately, and attempt to recover the filesystem.

At first I suspected the PCI Passthrough of the video card, which seems to be the source of a lot of crashes for a lot of users. But the crashes were replicable even without using the video card.

After an embarrassing amount of bisection and testing, it turned out that for this particular motherboard (ASRock X670E Taichi Carrarra), there exists a setting Advanced\AMD CBS\CPU Common Options\Core Watchdog\Core Watchdog Timer Enable in the BIOS, whose default setting (Auto) seems to be to ENABLE the Core Watchdog Timer, hence causing sudden reboots to occur at unpredictable intervals on Debian, and hence Proxmox as well.

The workaround is to set the Core Watchdog Timer Enable setting to Disable. In my case, that caused the system to become stable under load.

Because of these types of misbehaviors, I now only use zfs as a root file system for Proxmox. zfs played like a champ through all these random reboots, and never corrupted filesystem data once.

In closing, I'd like to send shame to ASRock for sticking this particular footgun into the default settings in the BIOS for its X670E motherboards. Additionally, I'd like to warn all motherboard manufacturers against enabling core watchdog timers by default in their respective BIOSes.
 
Last edited:
  • Like
Reactions: _gabriel

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!