Proxmox Server randomly crashes/restarts

AlexanderR

Well-Known Member
Jan 19, 2019
31
8
48
30
Hello Everybody,

we are facing a strange behaviour of one proxmox server on version 6.0.7: The server randomly restarts / crashes. Until now we haven't found anything special in the log files.

The crashes appear to be about 3-4 weeks apart from each other.

Is there a way to investigate the root cause of the crashes?

I would really appreciate any help and can of course provide any additional information or log files needed.


Best regards,

Alex
 
Hello Everybody,

we are facing a strange behaviour of one proxmox server on version 6.0.7: The server randomly restarts / crashes. Until now we haven't found anything special in the log files.

The crashes appear to be about 3-4 weeks apart from each other.

Is there a way to investigate the root cause of the crashes?

I would really appreciate any help and can of course provide any additional information or log files needed.


Best regards,

Alex
HI,
as a first step you could check the RAM by running an extended memory test. How does the system crash? Are you still able to connect via ssh? Do you get a kernel dump?
 
How could i get a kernel dump?

In fact the serve is not freezing... It is just restarting. I dont know if it is graceful or not because it is a remote system.

We plan to do a ram test at the upcoming weekend.
 
Hi,

I had a similar problem on a system with AMD Ryzen 7, but it was not CPU related.

The machine is now running for 5 days without restarting.

In my case, there were several hardware issues. One was memory related and it took about 3 hours running memtest to find the first error.

The second seems to be related to one of the SATA disks.

I am still investigating and waiting at least 10 days to close this case.

You can run "memtest" and "smartctl long" just to double check

Regards,

Ricardo Jorge
 
For a fast answer see last paragraph.

I had similar issues with my home lab server:

2 Socket Intel Xeon CPU E5645 - total 12 cores 24 threads
64GB ECC ram (4x16GB)
SuperMicro X8DTI-F
2x WD RED 1TB
TOSHIBA 4GB N300
Samsung SSD
Qualcomm Atheros AR93xx Wireless Network Adapter
Marvell 88SE9120 SATA Controller
VIA VL805 USB 3.0 Host Controller

I normally run 4-7 VMs on this machine. The main ones are:

pfSense router/firewall/access point (with pci pass-through)
FreeNAS with mirroring (with onboard Intel 82801JI SATA Controller pass through - WD Reds)
Virtualmin (CentOS) web server
Windows multipurpose vm

After trying many things I narrowed it down to power issues. It was either an unstable motherboard (didn't seem likely because of sudden onset after years of it being rock stable) or a bad PSU. I changed the PSU to a Corsair RM850x 850W. The issues went away for many months. I didn't have any issues with high loads or starting and stopping VMs after that. The issues came back all of a sudden one day executing a CPU demanding task. After that I could not start all my VMs. Starting one or two where ok, but If I tried to do a normal boot with only my main 4 VMs I would get an instant reboot. I could not bring myself to blame the recently bought PSU.

You see, what I failed to mention so far is that I am a moron :p (slapping myself).

Also, that all this equipment was being powered by a slightly undersized and maybe failing EATON 5E 850VA UPS (I know - I was just being cheap). Changing the Power supply to the RM850x may have masked the UPS issue in the first place. The Power Supply I had before the RM850x still works on other computers without issues.

Removing the UPS from the equation solved everything. I hope the exposure of my stupidity makes someone else act smarter and save some time :)
 
  • Like
Reactions: guletz