Proxmox Install Appears To Crash At Random Times

mhayhurst

Renowned Member
Jul 21, 2016
111
7
83
43
Hello everyone!

I've been using Proxmox on NUC's for a while now and recently purchased a new ASRock NUC Box 1165G7 then installed Proxmox 7.1 as a ZFS mirror on two internal disks (Sata SSD and NVMe M.2). I started experiencing odd things where pings and the Proxmox UI would stop responding. This would happen anywhere from a few minutes to several hours after power cycling the NUC Box. The last time this happened I was connected via SSH and these error messages appeared on the CLI:


Bash:
Message from syslogd@proxmox1 at Jan 25 13:43:40 ...
 kernel:[  700.197051] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:43:40 ...
 kernel:[  700.197051] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:08 ...
 kernel:[  728.197189] watchdog: BUG: soft lockup - CPU#6 stuck for 52s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:08 ...
 kernel:[  728.197189] watchdog: BUG: soft lockup - CPU#6 stuck for 52s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:13 ...
 kernel:[  733.317921] traps: PANIC: double fault, error_code: 0x0

Message from syslogd@proxmox1 at Jan 25 13:44:13 ...
 kernel:[  733.317921] traps: PANIC: double fault, error_code: 0x0

I read this might be a memory issue so I booted up my Proxmox installation USB and selected: "Test memory" but that does nothing except reboot my machine. Is there another way to test the memory or is there something else I should be looking at?
 
Hello everyone!

I've been using Proxmox on NUC's for a while now and recently purchased a new ASRock NUC Box 1165G7 then installed Proxmox 7.1 as a ZFS mirror on two internal disks (Sata SSD and NVMe M.2). I started experiencing odd things where pings and the Proxmox UI would stop responding. This would happen anywhere from a few minutes to several hours after power cycling the NUC Box. The last time this happened I was connected via SSH and these error messages appeared on the CLI:


Bash:
Message from syslogd@proxmox1 at Jan 25 13:43:40 ...
 kernel:[  700.197051] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:43:40 ...
 kernel:[  700.197051] watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:08 ...
 kernel:[  728.197189] watchdog: BUG: soft lockup - CPU#6 stuck for 52s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:08 ...
 kernel:[  728.197189] watchdog: BUG: soft lockup - CPU#6 stuck for 52s! [kvm:2070]

Message from syslogd@proxmox1 at Jan 25 13:44:13 ...
 kernel:[  733.317921] traps: PANIC: double fault, error_code: 0x0

Message from syslogd@proxmox1 at Jan 25 13:44:13 ...
 kernel:[  733.317921] traps: PANIC: double fault, error_code: 0x0
I've got similar "soft lockup - CPU stuck" errors when using faulty USB disks.
I read this might be a memory issue so I booted up my Proxmox installation USB and selected: "Test memory" but that does nothing except reboot my machine. Is there another way to test the memory or is there something else I should be looking at?
You can create a memtest86+ usb stick, boot into it and let it run over night.
 
I've got similar "soft lockup - CPU stuck" errors when using faulty USB disks.

You can create a memtest86+ usb stick, boot into it and let it run over night.

Memory test ran for 14 hours and passed with 0 errors but Proxmox is still crashing so I don't know. Since you experienced this problem with faulty USB drives is there a test I could perform on my SSD and NVMe M.2 drives?
 
Ok, try to play with these bios settings...

Okay. I had already disabled the: "CPU C States Support" but even after enabling various combinations of this and it's sub-settings Proxmox still crashes. Does this seem like something is dying? Before enabling the "CPU C States Support" I had rebooted and within minutes Proxmox crashed with this now:

kvm [2268]: vcpu5, guest rIP 0xffffffffb35cccf2 vmx: unexpected exit reason 0x3
set kvm_intel.dump_invalid_vmcs=1 to dump internal KVM state.

When this first started happening Promox would have an uptime of > 24 hours and now Proxmox (with no load) crashes within 1-15 minutes after each reboot. So it appears to be getting worse.
 
I don't know maybe a bug in the bios or a fault hardware...
Can give a try with the new kernel?

I updated to pve-kernel-5.15, rebooted and within 5 minutes Proxmox crashed. I'm not sure what else to do? I have Proxmox 6.4-13 installed on an Intel NUC (using similar hardware) and it's been a work horse with several VMs running and no problems. I'm thinking about returning this ASRock NUC and getting another Intel NUC...only downside is the Intel NUC has one NIC and the ASRock has two NICs which I want for pfSense.
 
I don't know maybe a bug in the bios or a fault hardware...
Can give a try with the new kernel?

Just heard back from ASRock:

'Thank you for contacting ASRock IPC Technical Support.

Regarding your question, we have tested NUC BOX-1165G7 in our lab.

The system could run Linux Ubuntu 21.04(Kernel 5.11) without any problem.

NUC BOX-1165G7 is Intel Tiger Lake Platform, and it requires at least kernel 5.7 or later to work properly.

Since Linux is an open source OS, we suggest install the required kernel and verifying with your configuration to check if it could meet the requirement.'


I'll let them know I am running Kernel 5.15 and see what they say. They also provided a download link to BIOS P1.50C which is not available on their main website to my knowledge. I'm going to update the BIOS to P1.50C as well.
 
Last edited:
Proxmox seems to work very well with the ASRock 4X4 BOX-4800U AMD Ryzen. I'm disappointed in ASRock's customer service as they have stopped responding to my emails. In conclusion, I would say ASRock's release of the NUC BOX-1165G7 is premature and that's why I was not able to use a virtualization environment like Proxmox.
 
Proxmox seems to work very well with the ASRock 4X4 BOX-4800U AMD Ryzen. I'm disappointed in ASRock's customer service as they have stopped responding to my emails. In conclusion, I would say ASRock's release of the NUC BOX-1165G7 is premature and that's why I was not able to use a virtualization environment like Proxmox.
Oh good, anyway you gain more room for VMs with the 4800U ;)
 
Oh good, anyway you gain more room for VMs with the 4800U ;)

Yes, you are right! I may have spoke too soon as ASRock responded to my email. They were closed celebrating Chinese New Year...I forgot about that. ASRock stated they installed Proxmox 7.1 on a NUC BOX-1165G7 in their lab and will be testing it. So it's possible they may find the problem and release a fix in their next BIOS update.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!