New hardware causing random reboots

Kiwa

New Member
Feb 6, 2024
6
1
3
Hello, after almost a week of despair trying hundreds of configurations, I'm asking for help so that someone who has had the same problem can tell me how to solve this.
I've built several Proxmox servers on different PCs and servers and I never had this problem.

Problem
At first everything worked fine. I installed Proxmox, installed the AMD CPU Microcode and put this node into the cluster.
Then I migrated the virtual machine from the Mini-PC to this new one.
I did a lot of tests and configs to get the Intel graphics card passthrough and when I did it the errors started;
The PC reboots itself without showing any errors in the LOG.
It is very easy to replicate the bug because no matter what I do in Proxmox (migrate or boot a VM) it will reboot itself.

From all the times it restarted I was able to capture some lines:

Aug 02 14:31:41 kiwa kernel: mce: Uncorrected hardware memory error in user-access at 1fa3d2680
Aug 02 14:31:41 kiwa kernel: Memory failure: 0x1fa3d2: recovery action for unsplit thp: Ignored
Aug 02 14:31:41 kiwa kernel: mce: Memory error not recovered
Aug 02 14:31:41 kiwa kernel: mce: [Hardware Error]: Machine check events logged
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: Uncorrected, software restartable error.
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: CPU:5 (19:21:2) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|-|Poison|-]: 0xbc00080001010135
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: Error Addr: 0x00000001fa3d2680
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: IPID: 0x001000b000000000
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 1
Aug 02 14:31:41 kiwa kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
-----------------------------------------
Aug 04 23:49:21 kiwa kernel: mce: Uncorrected hardware memory error in user-access at 21e2fca80
Aug 04 23:49:21 kiwa kernel: mce: [Hardware Error]: Machine check events logged
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: Uncorrected, software containable error.
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: CPU:16 (19:21:2) MC1_STATUS[-|UE|MiscV|AddrV|-|TCC|-|-|Poison|-]: 0xbc800800060c0859
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: Error Addr: 0x000000021e2fca80
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: IPID: 0x000100b000000000
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 12
Aug 04 23:49:21 kiwa kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)
Aug 04 23:49:21 kiwa kernel: Memory failure: 0x21e2fc: Sending SIGBUS to CPU 3/KVM:2010 due to hardware memory corruption
Aug 04 23:49:21 kiwa kernel: Memory failure: 0x21e2fc: recovery action for dirty LRU page: Recovered
Aug 04 23:49:21 kiwa QEMU[1902]: kvm: Guest MCE Memory Error at QEMU addr 0x7bac982fc000 and GUEST addr 0x1004fc000 of type BUS_MCEERR_AR injected
Aug 04 23:49:21 kiwa QEMU[1902]: kvm: Guest MCE Memory Error at QEMU addr 0x7bac982fc000 and GUEST addr 0x1004fc000 of type BUS_MCEERR_AR injected
-----------------------------------------
Strange LOG, I didn't send any reboot, 11 minutes after I do a VM migrate without any VM running and it reboots 4 minutes after without finish.
Aug 05 19:30:04 kiwa pmxcfs[1169]: [status] notice: received log
Aug 05 19:30:05 kiwa pmxcfs[1169]: [status] notice: received log
Aug 05 19:30:05 kiwa pmxcfs[1169]: [status] notice: received log
Aug 05 19:30:05 kiwa sshd[36826]: Accepted publickey for root from 192.168.10.12 port 57944 ssh2: RSA SHA256:1vypSvYRlCXwmy91GD6n/vhd/lyY+c15TA2npcNylBk
Aug 05 19:30:05 kiwa sshd[36826]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 05 19:30:05 kiwa systemd[1]: Created slice user-0.slice - User Slice of UID 0.
Aug 05 19:30:05 kiwa systemd[1]: Starting user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Aug 05 19:30:05 kiwa systemd-logind[926]: New session 5 of user root.
Aug 05 19:30:05 kiwa systemd[1]: Finished user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Aug 05 19:30:05 kiwa systemd[1]: Starting user@0.service - User Manager for UID 0...
Aug 05 19:30:05 kiwa (systemd)[36829]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0)
Aug 05 19:30:05 kiwa systemd[36829]: Queued start job for default target default.target.
Aug 05 19:30:05 kiwa systemd[36829]: Created slice app.slice - User Application Slice.
Aug 05 19:30:05 kiwa systemd[36829]: Reached target paths.target - Paths.
Aug 05 19:30:05 kiwa systemd[36829]: Reached target timers.target - Timers.
Aug 05 19:30:05 kiwa systemd[36829]: Starting dbus.socket - D-Bus User Message Bus Socket...
Aug 05 19:30:05 kiwa systemd[36829]: Listening on dirmngr.socket - GnuPG network certificate management daemon.
Aug 05 19:30:05 kiwa systemd[36829]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 05 19:30:05 kiwa systemd[36829]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Aug 05 19:30:05 kiwa systemd[36829]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Aug 05 19:30:05 kiwa systemd[36829]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Aug 05 19:30:05 kiwa systemd[36829]: Listening on dbus.socket - D-Bus User Message Bus Socket.
Aug 05 19:30:05 kiwa systemd[36829]: Reached target sockets.target - Sockets.
Aug 05 19:30:05 kiwa systemd[36829]: Reached target basic.target - Basic System.
Aug 05 19:30:05 kiwa systemd[36829]: Reached target default.target - Main User Target.
Aug 05 19:30:05 kiwa systemd[36829]: Startup finished in 88ms.
Aug 05 19:30:05 kiwa systemd[1]: Started user@0.service - User Manager for UID 0.
Aug 05 19:30:05 kiwa systemd[1]: Started session-5.scope - Session 5 of User root.
Aug 05 19:30:05 kiwa sshd[36826]: pam_env(sshd:session): deprecated reading of user environment enabled
Aug 05 19:30:05 kiwa login[36851]: pam_unix(login:session): session opened for user root(uid=0) by root(uid=0)
Aug 05 19:30:05 kiwa login[36856]: ROOT LOGIN on '/dev/pts/0' from '192.168.10.12'
Aug 05 19:30:12 kiwa sshd[36826]: Received disconnect from 192.168.10.12 port 57944:11: disconnected by user
Aug 05 19:30:12 kiwa sshd[36826]: Disconnected from user root 192.168.10.12 port 57944
Aug 05 19:30:12 kiwa sshd[36826]: pam_unix(sshd:session): session closed for user root
Aug 05 19:30:12 kiwa systemd-logind[926]: Session 5 logged out. Waiting for processes to exit.
Aug 05 19:30:12 kiwa systemd[1]: session-5.scope: Deactivated successfully.
Aug 05 19:30:12 kiwa systemd-logind[926]: Removed session 5.
Aug 05 19:30:12 kiwa pmxcfs[1169]: [status] notice: received log
Aug 05 19:30:23 kiwa systemd[1]: Stopping user@0.service - User Manager for UID 0...
Aug 05 19:30:23 kiwa systemd[36829]: Activating special unit exit.target...
Aug 05 19:30:23 kiwa systemd[36829]: Stopped target default.target - Main User Target.
Aug 05 19:30:23 kiwa systemd[36829]: Stopped target basic.target - Basic System.
Aug 05 19:30:23 kiwa systemd[36829]: Stopped target paths.target - Paths.
Aug 05 19:30:23 kiwa systemd[36829]: Stopped target sockets.target - Sockets.
Aug 05 19:30:23 kiwa systemd[36829]: Stopped target timers.target - Timers.
Aug 05 19:30:23 kiwa systemd[36829]: Closed dbus.socket - D-Bus User Message Bus Socket.
Aug 05 19:30:23 kiwa systemd[36829]: Closed dirmngr.socket - GnuPG network certificate management daemon.
Aug 05 19:30:23 kiwa systemd[36829]: Closed gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Aug 05 19:30:23 kiwa systemd[36829]: Closed gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Aug 05 19:30:23 kiwa systemd[36829]: Closed gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Aug 05 19:30:23 kiwa systemd[36829]: Closed gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Aug 05 19:30:23 kiwa systemd[36829]: Removed slice app.slice - User Application Slice.
Aug 05 19:30:23 kiwa systemd[36829]: Reached target shutdown.target - Shutdown.
Aug 05 19:30:23 kiwa systemd[36829]: Finished systemd-exit.service - Exit the Session.
Aug 05 19:30:23 kiwa systemd[36829]: Reached target exit.target - Exit the Session.
Aug 05 19:30:23 kiwa systemd[1]: user@0.service: Deactivated successfully.
Aug 05 19:30:23 kiwa systemd[1]: Stopped user@0.service - User Manager for UID 0.
Aug 05 19:30:23 kiwa systemd[1]: Stopping user-runtime-dir@0.service - User Runtime Directory /run/user/0...
Aug 05 19:30:23 kiwa systemd[1]: run-user-0.mount: Deactivated successfully.
Aug 05 19:30:23 kiwa systemd[1]: user-runtime-dir@0.service: Deactivated successfully.
Aug 05 19:30:23 kiwa systemd[1]: Stopped user-runtime-dir@0.service - User Runtime Directory /run/user/0.
Aug 05 19:30:23 kiwa systemd[1]: Removed slice user-0.slice - User Slice of UID 0.
Aug 05 19:41:38 kiwa pvedaemon[1335]: <root@pam> starting task UPID:kiwa:00009823:0013657B:66B10ED2:qmigrate:109:root@pam:
Aug 05 19:44:55 kiwa pmxcfs[1169]: [status] notice: received log
-- Reboot --
Aug 05 19:45:55 kiwa kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 05 19:45:55 kiwa kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet



Hardware
  • Proxmox Version: 6.8.8-4-pve
  • Kernel Microcode: 0x0a20120e
  • CPU: Ryzen 5900x
  • MB: Asus tuf gaming b550m plus wifi 2 (BIOS is updated)
  • RAM (First one tested): Corsair Vengeance LPX DDR4 3600 CL18 4x16Gb (CMK32GBX4M2D3600C18) (Optimized for INTEL, XMP only, running at 2100 Mhz cause XMP not working, BIOS crashes).
  • RAM (Currently using): Corsair Vengeance RGB RT DDR4 3600 CL16 4x16Gb (CMN32GX4M2Z3600C16) (Optimized for AMD, can run at 3600mhz with DOCP)
  • GPU: Intel Arc a380 (crashes with and without it installed)
  • PSU: Sharkoon SilentStorm SFX Gold 500W
  • M2-1: Samsung 990 PRO 1TB Gen4 (Proxmox installed here)
  • M2-2: Kingston 2TB

What I tested?
After checking many forum threads with people who had the same problem, I've tried several things.

  • Since I made a lot of changes (BIOS, GRUB...) I just had to revert many of the changes, but it didn't solve anything.
  • I've removed the Intel GPU and it kept doing the same thing.
  • So I migrated again the VM back to the Mini-PC, reset the BIOS, formatted and reinstalled Proxmox in the new PC. I did everything again without GPU but while migrating the VM it rebooted again without finishing the process. So GPU isn't the problem.
  • For testing I installed a Windows 11 VM and keeps crashing the node (4 cores, HOST, and 8GB RAM, with and without Intel GPU).
  • I installed some new RAM that is compatible with AMD and keeps crashing with OC and without.
  • Memtest during 4 hours says PASS with no errors.
  • I disabled C-States with no changes.
  • Things I still have to try:
    • When I finish the 4 hours Memtest I will try disabling the Boost Clock Override from BIOS.
      • Reboots after 10 min aprox with VM Windows 11 with BIOS Default, only virtualization and Boost Clock disabled. Log:
Aug 05 16:02:37 kiwa pvestatd[1315]: auth key pair too old, rotating..
Aug 05 16:05:13 kiwa systemd[1]: Starting systemd-tmpfiles-clean.service - Cleanup of Temporary Directories...
Aug 05 16:05:13 kiwa systemd[1]: systemd-tmpfiles-clean.service: Deactivated successfully.
Aug 05 16:05:13 kiwa systemd[1]: Finished systemd-tmpfiles-clean.service - Cleanup of Temporary Directories.
Aug 05 16:05:13 kiwa systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Aug 05 16:06:33 kiwa pmxcfs[1174]: [status] notice: received log
-- Reboot --
Aug 05 16:09:51 kiwa kernel: Linux version 6.8.8-4-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC PMX 6.8.8-4 (2024-07-26T11:15Z) ()
Aug 05 16:09:51 kiwa kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.8-4-pve root=/dev/mapper/pve-root ro quiet
  • Do a clean install again without installing cpu microcode and without putting the node in the cluster, and also full BIOS default but virtualization ON.
    • Brokes too.
  • Format and install Windows 11 to check if it is a Proxmox or Hardware problem.
    • Reboots and BSOD even without M2 and installing it in a Sata HDD.
  • Trying with another PSU
    • Reboots too in any S.O.
  • Test with 2 or 1 RAM module
    • Reboots too in any S.O.
  • 08/August/2024 Ordered a new motherboard, same model to check what is broken; the CPU or MB
12 August 2024
The motherboard was not broken. The CPU is broken. I bought a new 5900x and it works without any problem.
 
Last edited:
I think the hw error logs:

Aug 04 23:49:21 kiwa kernel: mce: Uncorrected hardware memory error in user-access at 21e2fca80
Aug 04 23:49:21 kiwa QEMU[1902]: kvm: Guest MCE Memory Error at QEMU addr 0x7bac982fc000 and GUEST addr 0x1004fc000 of type BUS_MCEERR_AR injected

indicate that either the memory or the cpu is the problem here... since you already changed the memory, i'd lean to the cpu
usually if it's a software problem you get at least a bit of logs/dump/etc. not a straight reboot

another thing that's possible is the PSU, if it can't keep up with whatever the PC is doing or is faulty

if the change from 'it's working' to 'it crashes' was that you put additional load onto it, that would support this theory
 
I think the hw error logs:



indicate that either the memory or the cpu is the problem here... since you already changed the memory, i'd lean to the cpu
usually if it's a software problem you get at least a bit of logs/dump/etc. not a straight reboot

another thing that's possible is the PSU, if it can't keep up with whatever the PC is doing or is faulty

if the change from 'it's working' to 'it crashes' was that you put additional load onto it, that would support this theory
Yes, it's definitely the CPU. Maybe the motherboard too.
I've tried with another PSU, removing the M2s, trying a Sata HDD.
Default BIOS, and installing Windows 11 or 10 it reboots itself in the first steps where you create the user or gives blue screens: "Whea uncorrectable error" at every startup.

I've processed a CPU warranty change bcause I don't have another motherboard to test it.

Thanks for your reply.
I found it very strange that Proxmox gave so many problems when I have never had them.
 
I've received the new 5900x cpu and everything running fine for now.
The problem was the CPU, it was broken.
 
  • Like
Reactions: dcsapak

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!