Total system freeze, how to debug?

May 28, 2018
68
11
13
38
I have a single machine running the latest Proxmox (recently updated it too), but it freezes completely about 1x per month.
The machine is a very simple setup (no cluster) and runs our office software. Some limited downtime is not a big problem.

This machine has:
- ZFS
- 6 WD Red drives
- Intel Optane SSD disk as ZIL
- Latest generation Xeon E-2176G CPU
- SUPERMICRO Server board MBD-X11SCA-F-O
- Intel 10 GBE Nic
- 64GB of ECC RAM


When it freezes, nothing works anymore:
- No SSH
- No web UI
- Not a single VM works
- Not a single Docker container works (I have installed Docker on Proxmox)

1 thing keeps working, and that is Nginx that I have on the host system.
I'm using this to forward domains to the docker containers...
This Nginx will give 503 bad gateways though, because the docker container behind it is dead obviously.

When looking at syslog, the last messages were:

Dec 8 00:21:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:21:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:21:00 proxmox systemd[1]: Started Proxmox VE replication runner.
Dec 8 00:22:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:22:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:22:00 proxmox systemd[1]: Started Proxmox VE replication runner.
Dec 8 00:23:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:23:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:23:00 proxmox systemd[1]: Started Proxmox VE replication runner.

After this I rebooted the machine entirely and everything came back up nicely.

So... what should be my next steps in trying to figure this out?
 
Last edited:
Hi,

do you use a rpool and got swap on this pool?
This was in the past the default ZFS setup.
If so remove the swap from the rpool and move it to the ZIL device.
 
What version of Proxmox VE do you use
Code:
pveversion -v
 
output is:

Code:
root@proxmox:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.0-12
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-2
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2
 
In this version are no known problems.

Can you please send me the disk layout.
And also arc_summary and memory usage.

Code:
lsblk
arc_summary
free -htl
 
Hi Wolfgang, thanks for your reply.
Here is the info you requested.
Adding it as attachment because the forum won't let me post long messages.
 

Attachments

  • proxmox.txt
    24.6 KB · Views: 6
Hi @wolfgang . Is there anything else I can do?

Your ARC Cache is set to use 32GB of your ram (50%), do you have anything that runs around the time that the server crashes that could suddenly cause the containers to require extra RAM?

Have you been able to connect a monitor to the server when it crashes and see if there is any output?
 
@sg90 unfortunately I have no idea about any extra load on the server. It seams like a long shot, as this server is hardly doing anything at this time. It's not fully in use yet (due to lack of time on my part).

I also don't have a monitor attached, but will try to find one.

Is there no way to get the screen contents from a log file or so?

I can also reduce the ARC RAM if you think it's too much, but since the server is not doing much, I can't think of any reason why 64GB of total RAM can't cut it. Is there a way to discover out of memory issues? Should be in syslog, no?
 
@sg90 unfortunately I have no idea about any extra load on the server. It seams like a long shot, as this server is hardly doing anything at this time. It's not fully in use yet (due to lack of time on my part).

I also don't have a monitor attached, but will try to find one.

Is there no way to get the screen contents from a log file or so?

I can also reduce the ARC RAM if you think it's too much, but since the server is not doing much, I can't think of any reason why 64GB of total RAM can't cut it. Is there a way to discover out of memory issues? Should be in syslog, no?

Depends on the crash, if its kernel panicking because it cant clear the ARC cache to disk quick enough for the memory demand only a screen really is the best way to see this.

You could try adjusting the ARC to say 25% of the available RAM and monitor, but having the screen output during the crash should help you see better what it may be.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!