I have been trying out Proxmox for around two months now. I have two "servers" (Dell Optiplex 7090s, i7-11700t, 128GB RAM, ZFS mirror on two Samsung PM863a 3.84TB drives, Intel i226v card (in addition to integrated Dell LAN). One runs perfectly, and the other crashes around every two days. When it crashes, it gets Seg faults. VM wise, I am running 2 Windows 2019 servers on it and one Ubuntu server. It is not stressed in the least. When it crashes, the 3 VMs are unresponsive, and even the pveproxy crashes, leaving the machine completely unresponsive. I have to pull the power or cycle the power using vPro. Also, when it crashes, it does not even write what happened to the logs. I have had to setup a syslog server on another machine and configure that to learn it is getting the segfaults. Here is what I have tried so far.
1) Changed the CPU (tried two other CPUs for a total of three)
2) Stress tested the CPU (passed)
3) Ran memtest for over 24 hours (passed)
4) Swapped the 4 - 16GB ram sticks out
5) Checked smart health on Samsung SSD (Passed and has no warnings)
6) Ran a ZFS scrub on the drives (completed with no errors)
7) Tried using two other machines (Dell Optiplex 5080 and Dell Optiplex 5090)
I do have TSO and GSO on the NICs disabled as that did cause me other issues
I am running the latest version with all the latest updates
I do have the latest agent and Virtio drivers installed on the VMs
The only item I have not swapped out (as I do not have any more) is the two Samsung PM863a drives, but I have ran tests on them and even wiped and recreated them and restored them from backups. They have a high (around 85% health status).
I have attached screenshots from the syslog server with the errors. The machine crashes randomly, usually when nothing is even happening. I do use PBS for backups and they run with no issues during the night. It does not seem to crash related to that.
Can anyone look at these logs and let me know if you see anything? This is so frustrating that they are identical and one crashes/ goes unresponsive and the other runs more VMs, but is rock solid. Somehow I feel it is related to one of the VMs running, but not sure why. One is a domain controller, one is bitwarden, and one is screenconnect. I am also running a secondary domain controller on the other Proxmox server, and that is working great.
1) Changed the CPU (tried two other CPUs for a total of three)
2) Stress tested the CPU (passed)
3) Ran memtest for over 24 hours (passed)
4) Swapped the 4 - 16GB ram sticks out
5) Checked smart health on Samsung SSD (Passed and has no warnings)
6) Ran a ZFS scrub on the drives (completed with no errors)
7) Tried using two other machines (Dell Optiplex 5080 and Dell Optiplex 5090)
I do have TSO and GSO on the NICs disabled as that did cause me other issues
I am running the latest version with all the latest updates
I do have the latest agent and Virtio drivers installed on the VMs
The only item I have not swapped out (as I do not have any more) is the two Samsung PM863a drives, but I have ran tests on them and even wiped and recreated them and restored them from backups. They have a high (around 85% health status).
I have attached screenshots from the syslog server with the errors. The machine crashes randomly, usually when nothing is even happening. I do use PBS for backups and they run with no issues during the night. It does not seem to crash related to that.
Can anyone look at these logs and let me know if you see anything? This is so frustrating that they are identical and one crashes/ goes unresponsive and the other runs more VMs, but is rock solid. Somehow I feel it is related to one of the VMs running, but not sure why. One is a domain controller, one is bitwarden, and one is screenconnect. I am also running a secondary domain controller on the other Proxmox server, and that is working great.
Attachments
Last edited: