Proxmox server randomly freezing and wont accept any input

Dannyzee

New Member
Mar 7, 2018
11
0
1
31
Hi,

We are new users of Proxmox with CEPH which works fine as of now but we have a strange problem.
Once in a few days the whole proxmox management server we have freezes completely and does nothing anymore.

The servers are SuperMicro servers and have IPMI. When we check IPMI it wont accept input anymore as well, i have to completely reboot it in order to make it work again until it freezes again, i havent been able to find a pattern yet so it does it completely random.

We also have 3 monitor servers with exact the same hardware and configuration and these don't freeze at all.

Also atop logs until the time it freezes, but we don't see any significant proces running other then what was already running.

Any idea's/ tricks we can do to fix this? The ceph/proxmox configuration is still in a test stadium so it is not urgent, but we like it to work so we can use the full function of the proxmox/ceph environment.
 
Hi,
The servers are SuperMicro servers and have IPMI. When we check IPMI it wont accept input anymore as well, i have to completely reboot it in order to make it work again until it freezes again, i havent been able to find a pattern yet so it does it completely random.
If the IPMI is stopping to response you have a HW problem.
No OS can brake the IPMI.
 
Hi,

If the IPMI is stopping to response you have a HW problem.
No OS can brake the IPMI.

Hello Wolfgang,

Thank you for your reply, that is what we also are thinking but do you have an idea how we can check this out?
I have done multiple Memtests which went totally fine without any faults

is there a way how we can check what is exactly broken?
 
I have done multiple Memtests which went totally fine without any faults
A working memory do not prove that the IPMI is not corrupt.

is there a way how we can check what is exactly broken?
Contact your HW vendor. Some has special tools for this task.
 
"If the IPMI is stopping to response you have a HW problem.
No OS can brake the IPMI."

In my experience yes it is possible.

In BIOS there is an option enable/disable IPMI for OS. Keep it disabled.
 
"If the IPMI is stopping to response you have a HW problem.
No OS can brake the IPMI."

In my experience yes it is possible.

In BIOS there is an option enable/disable IPMI for OS. Keep it disabled.

The IPMI was still responding, just the OS took no respond, so pressing any key did literally nothing
 
A working memory do not prove that the IPMI is not corrupt.


Contact your HW vendor. Some has special tools for this task.

Re-reading this made me thing, IPMI works just the console takes no more response, is there another way we can check this because i don't think it is HW related as the IPMI itself works fine. When we go to DC and attach keyboard OS takes no response either until we reboot
 
Do you see any errors in the syslog?
Turn on core dump and make journald persistent.

How to make journald persistent
Code:
mkdir /var/log/journal
systemctl restart systemd-journald

see
https://pve.proxmox.com/wiki/Enable_Core_Dump_systemd


May be you get a hint this way.
 
Hello,

SSH is not working anymore. We cant login.
When we login into IPMI and open iKVM HTML console it doesn't accept input anymore as well, until we reboot it, then it works fine.

The whole system freezes but IPMI works fine, also we went to Datacentre once when it happened, then it was the same problem. The machine was on and running but SSH was froze and wouldn't accept any input. While our other ceph machines work fine, it is only this manager machine. Our monitor's and storage servers run fine
 
Hi,

The server is freezing again. How can i "read" these core dumps? I have installed gdb but that is as far as my intelligence goes on core dumps
 
Hi,

I have checked but there are no core dumps when i check

/var/lib/coredumps# ls -la

total 8
drwxr-xr-x 2 root root 4096 Mar 9 09:39 .
drwxr-xr-x 44 root root 4096 Mar 9 09:39 ..


Whenever i shut down PVE it doesn't make the core dumps either and i followed every exact step from the guide
 
I am having random freezes as well with the xact same symptoms as you.
I am also running a Supermicro board with 2 xeon e5-v3 processors.

I cleaned up all error messages first, like ntp, CPU microcode update, GPU power states etc.

Still freezes randomly. It lasted for 24hours which is a new record for me.

I followed the coredumps write-up as well and it wasn't working for me.
Perhaps that write-up needs an update..
 
I'm seeing this as well on one of our Supermicro boards which is running 2x Xeon E5-2640v4 on an X10DRi motherboard. I wrote it off as an anomaly but it now happens about once every 1-3 months. I've got recorded outages as 7/10/2017, 9/1/2018, 1/2/2018 and 28/3/2018.

PVE and VMs lock up. IPMI is still functional and reads 30fps through the java console (so capture is still functioning) but it's just a black screen unlike a regular kernel panic condition where I'd expect to see the panic output. The only thing I can do is force reboot via IPMI.

Fortunately it's running on very fast SSDs so I can often have the whole thing rebooted and VMs running in less than 5 minutes.

Unfortunately, nothing of substance makes it into any of the logs.
 
We are experiencing the same problems with a Supermicro Board X9DRi. Are there any new findings on this topic?
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!