Total system freeze, how to debug?

Wouter Samaey · Dec 9, 2019

I have a single machine running the latest Proxmox (recently updated it too), but it freezes completely about 1x per month.
The machine is a very simple setup (no cluster) and runs our office software. Some limited downtime is not a big problem.

This machine has:
- ZFS
- 6 WD Red drives
- Intel Optane SSD disk as ZIL
- Latest generation Xeon E-2176G CPU
- SUPERMICRO Server board MBD-X11SCA-F-O
- Intel 10 GBE Nic
- 64GB of ECC RAM

When it freezes, nothing works anymore:
- No SSH
- No web UI
- Not a single VM works
- Not a single Docker container works (I have installed Docker on Proxmox)

1 thing keeps working, and that is Nginx that I have on the host system.
I'm using this to forward domains to the docker containers...
This Nginx will give 503 bad gateways though, because the docker container behind it is dead obviously.

When looking at syslog, the last messages were:

Dec 8 00:21:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:21:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:21:00 proxmox systemd[1]: Started Proxmox VE replication runner.
Dec 8 00:22:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:22:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:22:00 proxmox systemd[1]: Started Proxmox VE replication runner.
Dec 8 00:23:00 proxmox systemd[1]: Starting Proxmox VE replication runner...
Dec 8 00:23:00 proxmox systemd[1]: pvesr.service: Succeeded.
Dec 8 00:23:00 proxmox systemd[1]: Started Proxmox VE replication runner.

After this I rebooted the machine entirely and everything came back up nicely.

So... what should be my next steps in trying to figure this out?

wolfgang · Dec 9, 2019

Hi,

do you use a rpool and got swap on this pool?
This was in the past the default ZFS setup.
If so remove the swap from the rpool and move it to the ZIL device.

Wouter Samaey · Dec 10, 2019

Hello Wolfgang. Thanks for your reply.
I have an rpool, but not using swap. The command "swapon -s" returns nothing.
How can we investigate further?

wolfgang · Dec 11, 2019

What version of Proxmox VE do you use

Code:

pveversion -v

Wouter Samaey · Dec 11, 2019

output is:

Code:

root@proxmox:~# pveversion -v
proxmox-ve: 6.1-2 (running kernel: 5.3.10-1-pve)
pve-manager: 6.1-3 (running version: 6.1-3/37248ce6)
pve-kernel-5.3: 6.0-12
pve-kernel-helper: 6.0-12
pve-kernel-5.0: 6.0-11
pve-kernel-4.15: 5.4-6
pve-kernel-5.3.10-1-pve: 5.3.10-1
pve-kernel-5.0.21-5-pve: 5.0.21-10
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve4
criu: 3.11-3
glusterfs-client: 5.5-3
ifupdown: 0.8.35+pve1
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.13-pve1
libpve-access-control: 6.0-5
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-9
libpve-guest-common-perl: 3.0-3
libpve-http-server-perl: 3.0-3
libpve-storage-perl: 6.1-2
libqb0: 1.0.5-1
libspice-server1: 0.14.2-4~pve6+1
lvm2: 2.03.02-pve3
lxc-pve: 3.2.1-1
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.1-1
pve-cluster: 6.1-2
pve-container: 3.0-14
pve-docs: 6.1-3
pve-edk2-firmware: 2.20191002-1
pve-firewall: 4.0-9
pve-firmware: 3.0-4
pve-ha-manager: 3.0-8
pve-i18n: 2.0-3
pve-qemu-kvm: 4.1.1-2
pve-xtermjs: 3.13.2-1
qemu-server: 6.1-2
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.2-pve2

wolfgang · Dec 12, 2019

In this version are no known problems.

Can you please send me the disk layout.
And also arc_summary and memory usage.

Code:

lsblk
arc_summary
free -htl

Wouter Samaey · Dec 12, 2019

Hi Wolfgang, thanks for your reply.
Here is the info you requested.
Adding it as attachment because the forum won't let me post long messages.

Wouter Samaey · Dec 20, 2019

Hi @wolfgang . Is there anything else I can do?

sg90 · Dec 20, 2019

Wouter Samaey said:
Hi @wolfgang . Is there anything else I can do?

Your ARC Cache is set to use 32GB of your ram (50%), do you have anything that runs around the time that the server crashes that could suddenly cause the containers to require extra RAM?

Have you been able to connect a monitor to the server when it crashes and see if there is any output?

Wouter Samaey · Dec 20, 2019

@sg90 unfortunately I have no idea about any extra load on the server. It seams like a long shot, as this server is hardly doing anything at this time. It's not fully in use yet (due to lack of time on my part).

I also don't have a monitor attached, but will try to find one.

Is there no way to get the screen contents from a log file or so?

I can also reduce the ARC RAM if you think it's too much, but since the server is not doing much, I can't think of any reason why 64GB of total RAM can't cut it. Is there a way to discover out of memory issues? Should be in syslog, no?

sg90 · Dec 20, 2019

Wouter Samaey said:
@sg90 unfortunately I have no idea about any extra load on the server. It seams like a long shot, as this server is hardly doing anything at this time. It's not fully in use yet (due to lack of time on my part).

I also don't have a monitor attached, but will try to find one.

Is there no way to get the screen contents from a log file or so?

I can also reduce the ARC RAM if you think it's too much, but since the server is not doing much, I can't think of any reason why 64GB of total RAM can't cut it. Is there a way to discover out of memory issues? Should be in syslog, no?

Depends on the crash, if its kernel panicking because it cant clear the ARC cache to disk quick enough for the memory demand only a screen really is the best way to see this.

You could try adjusting the ARC to say 25% of the available RAM and monitor, but having the screen output during the crash should help you see better what it may be.

Search

Search

Total system freeze, how to debug?

Wouter Samaey

Well-Known Member

wolfgang

Proxmox Retired Staff

Wouter Samaey

Well-Known Member

wolfgang

Proxmox Retired Staff

Wouter Samaey

Well-Known Member

wolfgang

Proxmox Retired Staff

Wouter Samaey

Well-Known Member

Attachments

Wouter Samaey

Well-Known Member

sg90

Renowned Member

Wouter Samaey

Well-Known Member

sg90

Renowned Member

We value your privacy