Troubleshooting a crashing PVE server?

ZPrime · Jul 13, 2022

Hello all... I have an Intel 8th-gen i5 system (Protectli FW6D) running the current stable PVE.
Linux 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)
It has two disks - M.2 SATA-based SSD (Kingston), and a separate 2.5" SATA SSD (Samsung 840 Pro).
I have the two disks in a ZFS mirror that is the root for the OS as well as VM storage. (I don't need a lot of space on this system!)

I am using PCI-passthrough to push 3 of the system's 6 Intel I210 NICs down to a q35-based guest, running OPNsense (firewall distribution). This VM is my main internet router. Another NIC is setup with Linux bridging to be able to service other VMs (not yet active), and I have a NIC dedicated entirely to the PVE host itself. I think ip link output is helpful here:

Code:

root@pve:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: enp4s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:e0:67:25:16:c3 brd ff:ff:ff:ff:ff:ff
6: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
    link/ether 00:e0:67:25:16:c4 brd ff:ff:ff:ff:ff:ff
7: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 00:e0:67:25:16:c5 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 32:e8:b1:59:dc:21 brd ff:ff:ff:ff:ff:ff
9: tap100i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 42:8b:b2:b4:f0:04 brd ff:ff:ff:ff:ff:ff

The only "non-standard" thing going on is that I have Nut installed on the PVE host (using the official debian package), an APC UPS is attached via USB, and I have the guest VM talking to Nut on the host, as well as a Synology NAS (both in client mode).

I've now had this whole setup live for about a month, and I've had two separate hard-crashes. I notice it because the internet goes down.

The PVE host is totally unresponsive to keyboard input, won't answer ping, ctrl-alt-del doesn't reboot it, all I can do is hold the power button until it shuts off.

What steps should I take next to try to figure out what is causing this crash?
/var/log/syslog doesn't show anything particularly obvious leading up to the freeze/lockup.

Code:

Jul 12 22:17:12 pve smartd[2051]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 51 to 52
Jul 12 22:17:12 pve smartd[2051]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jul 12 22:43:00 pve postfix/qmgr[2585]: 64C8612196: from=<root@pve.redacted>, size=1827, nrcpt=1 (queue active)
Jul 12 22:43:01 pve postfix/smtp[854328]: connect to aspmx.l.google.com[2607:f8b0:4001:c11::1b]:25: Network is unreachable
Jul 12 22:43:31 pve postfix/smtp[854328]: connect to aspmx.l.google.com[108.177.121.26]:25: Connection timed out
Jul 12 22:43:31 pve postfix/smtp[854328]: connect to alt1.aspmx.l.google.com[2607:f8b0:4023:401::1b]:25: Network is unreachable
Jul 12 22:44:01 pve postfix/smtp[854328]: connect to alt1.aspmx.l.google.com[173.194.77.27]:25: Connection timed out
Jul 12 22:44:01 pve postfix/smtp[854328]: connect to alt2.aspmx.l.google.com[2607:f8b0:4002:c03::1b]:25: Network is unreachable
Jul 12 22:44:01 pve postfix/smtp[854328]: 64C8612196: to=<me@mydomain.redacted>, relay=none, delay=244083, delays=244023/0.01/61/0, dsn=4.4.1, status=deferred (connect to alt2.aspmx.l.google.com[2607:f8b0:4002:c03::1b]:25: Network is unreachable)
Jul 12 22:47:12 pve smartd[2051]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53
Jul 12 22:47:12 pve smartd[2051]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62

Literally the last thing in the log before I cycled power is this smartd alert about a small temperature change. I'm not sure if the postfix errors are related, either...

dcsapak · Jul 13, 2022

a complete hard crash without any kernel/console logging is most often some kind of hardware failure (cpu/memory/psu/etc.)

first thing i'd try is to update all relevant bios/firmware versions

then, you could try to setup a remote syslog and see if the host will log something over the network (e.g. if the disks are the problem)
you could also try boot a different kernel from the time being (to rule that out), e.g. a 5.13 kernel

another option to debug would be to remove as much of hardware dependency of the vms as possible (e.g. removing pci passthrough) since those
things can also lead to problems

ZPrime · Jul 14, 2022

dcsapak said:
a complete hard crash without any kernel/console logging is most often some kind of hardware failure (cpu/memory/psu/etc.)

This is my worry as well... the system isn't that old but I don't think it has any warranty left. Protectli is just buying white-label stuff from China and having their own name slapped on the side, for the most part.

dcsapak said:
first thing i'd try is to update all relevant bios/firmware versions

Unfortunately it's all current, I just did that before putting this into "production." (Making it my main firewall at home, where I have the most demanding user of all, the spouse!)

dcsapak said:
then, you could try to setup a remote syslog and see if the host will log something over the network (e.g. if the disks are the problem)
you could also try boot a different kernel from the time being (to rule that out), e.g. a 5.13 kernel

The console didn't show anything of value either - it was displaying info still when I turned on the monitor, but no kernel messages or anything sitting there for me to stare at.

I would expect if remote syslog was going to show anything, the console display would too?

dcsapak said:
another option to debug would be to remove as much of hardware dependency of the vms as possible (e.g. removing pci passthrough) since those
things can also lead to problems

See, I was doing PCI-passthrough because I figured it was better for network performance (running a virtualized router). I've heard complaints about performance with virtualized NICs. Another thought I had - is it possible for there to be weird interactions if the VM's support of the Q35 "hardware" isn't great? OPNsense just recently moved to FreeBSD 13 as the core platform, and 13 only recently added full support for Q35, I believe. Something with the qemu drivers not supporting the Q35 interfaces until FBSD13 came out... I know they said to use the i440 "hardware" in the past, but I never noticed that and went straight for Q35 when I setup this VM (and I guess it only worked because I was running the current OPNsense with FBSD13 below it).

For now, I pulled the system back out of "production" and moved back to my trusty old Supermicro Atom C2558 box, running OPNsense on bare metal. Need to try to come up with a way to test the PVE installation further without disrupting my WAN access at home...

Dunuin · Jul 14, 2022

To check that the RAM isn't the problem you could boot into memtest86+ and let it run for several hours.

Troubleshooting a crashing PVE server?

ZPrime

New Member

dcsapak

Proxmox Staff Member

ZPrime

New Member

Dunuin

Distinguished Member

We value your privacy