Proxmox goes unresponsive (crashes)

simpleone71

New Member
Aug 28, 2024
6
0
1
I have been trying out Proxmox for around two months now. I have two "servers" (Dell Optiplex 7090s, i7-11700t, 128GB RAM, ZFS mirror on two Samsung PM863a 3.84TB drives, Intel i226v card (in addition to integrated Dell LAN). One runs perfectly, and the other crashes around every two days. When it crashes, it gets Seg faults. VM wise, I am running 2 Windows 2019 servers on it and one Ubuntu server. It is not stressed in the least. When it crashes, the 3 VMs are unresponsive, and even the pveproxy crashes, leaving the machine completely unresponsive. I have to pull the power or cycle the power using vPro. Also, when it crashes, it does not even write what happened to the logs. I have had to setup a syslog server on another machine and configure that to learn it is getting the segfaults. Here is what I have tried so far.

1) Changed the CPU (tried two other CPUs for a total of three)
2) Stress tested the CPU (passed)
3) Ran memtest for over 24 hours (passed)
4) Swapped the 4 - 16GB ram sticks out
5) Checked smart health on Samsung SSD (Passed and has no warnings)
6) Ran a ZFS scrub on the drives (completed with no errors)
7) Tried using two other machines (Dell Optiplex 5080 and Dell Optiplex 5090)

I do have TSO and GSO on the NICs disabled as that did cause me other issues
I am running the latest version with all the latest updates
I do have the latest agent and Virtio drivers installed on the VMs

The only item I have not swapped out (as I do not have any more) is the two Samsung PM863a drives, but I have ran tests on them and even wiped and recreated them and restored them from backups. They have a high (around 85% health status).

I have attached screenshots from the syslog server with the errors. The machine crashes randomly, usually when nothing is even happening. I do use PBS for backups and they run with no issues during the night. It does not seem to crash related to that.

Can anyone look at these logs and let me know if you see anything? This is so frustrating that they are identical and one crashes/ goes unresponsive and the other runs more VMs, but is rock solid. Somehow I feel it is related to one of the VMs running, but not sure why. One is a domain controller, one is bitwarden, and one is screenconnect. I am also running a secondary domain controller on the other Proxmox server, and that is working great.
 

Attachments

  • 1.png
    1.png
    346.9 KB · Views: 12
  • 8.png
    8.png
    299.7 KB · Views: 12
  • 7.png
    7.png
    330 KB · Views: 9
  • 6.png
    6.png
    349 KB · Views: 6
  • 5.png
    5.png
    328.1 KB · Views: 6
  • 4.png
    4.png
    339 KB · Views: 6
  • 3.png
    3.png
    343 KB · Views: 6
  • 2.png
    2.png
    328.5 KB · Views: 10
Last edited:
Just crashed again
It's typically a hardware issue but you already tested the hardware and you can probably not replace hardware parts just for testing. Maybe it's a Linux kernel incompatibility, which happened to others with kernel version 6.8.x: try the latest kernel version 6.8.12 (no-subscription repository) or try the previous kernel version 6.5.
 
Hello,

which kernel version were you running at the moment of the crash? You can check the current version with

Code:
uname -a

and you can see the last 20 booted kernels via

Code:
last reboot -F -n20

Are you using `mdraid`? You can check the system logs to verify as some appliances use it.

Is it possible for you to retrieve the logs in the pictures as plain text? This would be extremely helpful.

it does not even write what happened to the logs. I have had to setup a syslog server on another machine and configure that to learn it is getting the segfaults.

Out of curiosity, how did you set this up? If I understood correctly, the system logs just end abruptly before the next boot?




As recommended above, you can try other kernel versions either 6.5 or the latest version in the 6.8 series. You can find info at our documentation [1] on how to pin a specific kernel version. Another thing that might help is installing the latest microcode updates, see [2].

[1] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#sysboot_kernel_pin
[2] https://pve.proxmox.com/pve-docs/pve-admin-guide.html#_set_up_early_os_microcode_updates
 
Last edited:
Thanks, I appreciate any assistance.

uname -a
Linux pve01 6.8.12-1-pve #1 SMP PREEMPT_DYNAMIC PMX 6.8.12-1 (2024-08-05T16:17Z) x86_64 GNU/Linux

last reboot -F -n20
reboot system boot 6.8.12-1-pve Thu Aug 29 19:28:28 2024 still running
reboot system boot 6.8.12-1-pve Wed Aug 28 07:37:50 2024 still running
reboot system boot 6.8.12-1-pve Mon Aug 26 07:02:56 2024 still running
reboot system boot 6.8.12-1-pve Fri Aug 23 15:33:54 2024 - Sat Aug 24 20:07:30 2024 (1+04:33)
reboot system boot 6.8.12-1-pve Thu Aug 22 07:41:17 2024 - Sat Aug 24 20:07:30 2024 (2+12:26)
reboot system boot 6.8.12-1-pve Thu Aug 22 06:53:15 2024 - Thu Aug 22 07:40:48 2024 (00:47)
reboot system boot 6.8.12-1-pve Wed Aug 21 17:02:49 2024 - Thu Aug 22 07:40:48 2024 (14:37)
reboot system boot 6.8.12-1-pve Sat Aug 17 19:59:35 2024 - Wed Aug 21 17:02:19 2024 (3+21:02)
reboot system boot 6.8.12-1-pve Sat Aug 17 17:00:35 2024 - Wed Aug 21 17:02:19 2024 (4+00:01)
reboot system boot 6.8.12-1-pve Thu Aug 15 18:31:03 2024 - Wed Aug 21 17:02:19 2024 (5+22:31)
reboot system boot 6.8.12-1-pve Thu Aug 15 08:00:47 2024 - Thu Aug 15 17:58:35 2024 (09:57)
reboot system boot 6.8.12-1-pve Tue Aug 13 08:25:20 2024 - Thu Aug 15 17:58:35 2024 (2+09:33)
reboot system boot 6.8.12-1-pve Tue Aug 13 07:15:02 2024 - Tue Aug 13 08:24:51 2024 (01:09)
reboot system boot 6.8.12-1-pve Mon Aug 12 17:55:07 2024 - Tue Aug 13 08:24:51 2024 (14:29)
reboot system boot 6.8.12-1-pve Mon Aug 12 17:39:33 2024 - Mon Aug 12 17:54:38 2024 (00:15)
reboot system boot 6.8.12-1-pve Mon Aug 12 10:51:12 2024 - Mon Aug 12 17:35:50 2024 (06:44)
reboot system boot 6.8.12-1-pve Mon Aug 12 09:38:09 2024 - Mon Aug 12 17:35:50 2024 (07:57)
reboot system boot 6.8.12-1-pve Sat Aug 10 18:30:50 2024 - Mon Aug 12 17:35:50 2024 (1+23:05)
reboot system boot 6.8.12-1-pve Sat Aug 10 18:19:17 2024 - Sat Aug 10 18:29:52 2024 (00:10)
reboot system boot 6.8.12-1-pve Sat Aug 10 18:04:11 2024 - Sat Aug 10 18:17:55 2024 (00:13)


I installed a syslog server on a VM machine and installed rsyslog on the PVE host and configured it to send logs to the syslog server.

Here are the text logs from a previous crash. I will look at installing the 6.5 kernal to see what happens.
 

Attachments

crashed again. Log file attached.

Hopefully someone will see something to direct me to other things to try. I feel I've wasted almost two months of my life testing and trying different things to eliminate items. Starting to think, that Proxmox is just not an enterprise-ready solution. I use a lot of open source software in my home lab and do appreciate Proxmox to limited home usage. I am testing alternatives to VMware for potential replacements for my job. I understand that my running this on an Optiplex is not enterprise hardware, from my experience, enterprise hardware is much more complex and this just keeps crashing on simpler hardware. I will not post again if no one has any thoughts and appreciate the one comment.
 

Attachments

Last edited:
Maybe there is a difference between the BIOS version or another device firmware.

Can you compare the BIOS version between stable vs unstable?

Also please check the temperature of all the components
 
Last edited:
The BIOS versions are the same and latest from Dell. The temps are within normal ranges. Although these are small form factor PCs I have added an additional noctura fan in the front to keep them very cool.
 
It has crashed again. I am done and will move on to my next test product for converting a datacenter off of VMware.

Thanks to the few that offered some things to check for.

Moderators can close this thread as unsolved.
 
I have been trying out Proxmox for around two months now. I have two "servers" (Dell Optiplex 7090s, i7-11700t, 128GB RAM, ZFS mirror on two Samsung PM863a 3.84TB drives, Intel i226v card (in addition to integrated Dell LAN). One runs perfectly, and the other crashes around every two days. When it crashes, it gets Seg faults. VM wise, I am running 2 Windows 2019 servers on it and one Ubuntu server. It is not stressed in the least. When it crashes, the 3 VMs are unresponsive, and even the pveproxy crashes, leaving the machine completely unresponsive. I have to pull the power or cycle the power using vPro. Also, when it crashes, it does not even write what happened to the logs. I have had to setup a syslog server on another machine and configure that to learn it is getting the segfaults. Here is what I have tried so far.

1) Changed the CPU (tried two other CPUs for a total of three)
2) Stress tested the CPU (passed)
3) Ran memtest for over 24 hours (passed)
4) Swapped the 4 - 16GB ram sticks out
5) Checked smart health on Samsung SSD (Passed and has no warnings)
6) Ran a ZFS scrub on the drives (completed with no errors)
7) Tried using two other machines (Dell Optiplex 5080 and Dell Optiplex 5090)

I do have TSO and GSO on the NICs disabled as that did cause me other issues
I am running the latest version with all the latest updates
I do have the latest agent and Virtio drivers installed on the VMs

The only item I have not swapped out (as I do not have any more) is the two Samsung PM863a drives, but I have ran tests on them and even wiped and recreated them and restored them from backups. They have a high (around 85% health status).

I have attached screenshots from the syslog server with the errors. The machine crashes randomly, usually when nothing is even happening. I do use PBS for backups and they run with no issues during the night. It does not seem to crash related to that.

Can anyone look at these logs and let me know if you see anything? This is so frustrating that they are identical and one crashes/ goes unresponsive and the other runs more VMs, but is rock solid. Somehow I feel it is related to one of the VMs running, but not sure why. One is a domain controller, one is bitwarden, and one is screenconnect. I am also running a secondary domain controller on the other Proxmox server, and that is working great.
I was/am using a DELL Optiplex 7010 i5 with 16GB of memory using ZFS mirror of two 500GB "spinning" Seagate SATA drives. I only wanted to use it to run two VM's. I installed VE 8.2 twice and was able to create my two VM's but it was hung at a fatal error at the console after some time and that repeated after the two reinstalls. I then installed VE 7.x (what ever was the last release of 7) and I have been running fine for months. I had also attempted diagnostics and DELL BIOS upgrade, but I gave up on V8 for that specific DELL hardware. I installed V8.2 on new Supermicro Server and that works fine. P.S. Installing Proxmox twice was very easy but VM # 1 was a Windows 2022 Server with Veeam B&R CE and VM #2 was a Veeam Linux Hardened Repository and those took a long time to recreate each time. I am thankful that V7 is stable, but I can't use Veeam B&R for Proxmix VE unless I am on V8.2. Doesn't matter Veeam is backing up three laptops and one desktop to a hardened repository made up of four 2TB drives and I am happy with that for home use.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!