Hello all... I have an Intel 8th-gen i5 system (Protectli FW6D) running the current stable PVE.
It has two disks - M.2 SATA-based SSD (Kingston), and a separate 2.5" SATA SSD (Samsung 840 Pro).
I have the two disks in a ZFS mirror that is the root for the OS as well as VM storage. (I don't need a lot of space on this system!)
I am using PCI-passthrough to push 3 of the system's 6 Intel I210 NICs down to a q35-based guest, running OPNsense (firewall distribution). This VM is my main internet router. Another NIC is setup with Linux bridging to be able to service other VMs (not yet active), and I have a NIC dedicated entirely to the PVE host itself. I think
The only "non-standard" thing going on is that I have Nut installed on the PVE host (using the official debian package), an APC UPS is attached via USB, and I have the guest VM talking to Nut on the host, as well as a Synology NAS (both in client mode).
I've now had this whole setup live for about a month, and I've had two separate hard-crashes. I notice it because the internet goes down. The PVE host is totally unresponsive to keyboard input, won't answer ping, ctrl-alt-del doesn't reboot it, all I can do is hold the power button until it shuts off.
What steps should I take next to try to figure out what is causing this crash?
/var/log/syslog doesn't show anything particularly obvious leading up to the freeze/lockup.
Literally the last thing in the log before I cycled power is this smartd alert about a small temperature change. I'm not sure if the postfix errors are related, either...
Linux 5.15.39-1-pve #1 SMP PVE 5.15.39-1 (Wed, 22 Jun 2022 17:22:00 +0200)
It has two disks - M.2 SATA-based SSD (Kingston), and a separate 2.5" SATA SSD (Samsung 840 Pro).
I have the two disks in a ZFS mirror that is the root for the OS as well as VM storage. (I don't need a lot of space on this system!)
I am using PCI-passthrough to push 3 of the system's 6 Intel I210 NICs down to a q35-based guest, running OPNsense (firewall distribution). This VM is my main internet router. Another NIC is setup with Linux bridging to be able to service other VMs (not yet active), and I have a NIC dedicated entirely to the PVE host itself. I think
ip link
output is helpful here:
Code:
root@pve:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
5: enp4s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:e0:67:25:16:c3 brd ff:ff:ff:ff:ff:ff
6: enp5s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UP mode DEFAULT group default qlen 1000
link/ether 00:e0:67:25:16:c4 brd ff:ff:ff:ff:ff:ff
7: enp6s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 00:e0:67:25:16:c5 brd ff:ff:ff:ff:ff:ff
8: vmbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
link/ether 32:e8:b1:59:dc:21 brd ff:ff:ff:ff:ff:ff
9: tap100i0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc mq master vmbr0 state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 42:8b:b2:b4:f0:04 brd ff:ff:ff:ff:ff:ff
The only "non-standard" thing going on is that I have Nut installed on the PVE host (using the official debian package), an APC UPS is attached via USB, and I have the guest VM talking to Nut on the host, as well as a Synology NAS (both in client mode).
I've now had this whole setup live for about a month, and I've had two separate hard-crashes. I notice it because the internet goes down. The PVE host is totally unresponsive to keyboard input, won't answer ping, ctrl-alt-del doesn't reboot it, all I can do is hold the power button until it shuts off.
What steps should I take next to try to figure out what is causing this crash?
/var/log/syslog doesn't show anything particularly obvious leading up to the freeze/lockup.
Code:
Jul 12 22:17:12 pve smartd[2051]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 51 to 52
Jul 12 22:17:12 pve smartd[2051]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 64 to 63
Jul 12 22:43:00 pve postfix/qmgr[2585]: 64C8612196: from=<root@pve.redacted>, size=1827, nrcpt=1 (queue active)
Jul 12 22:43:01 pve postfix/smtp[854328]: connect to aspmx.l.google.com[2607:f8b0:4001:c11::1b]:25: Network is unreachable
Jul 12 22:43:31 pve postfix/smtp[854328]: connect to aspmx.l.google.com[108.177.121.26]:25: Connection timed out
Jul 12 22:43:31 pve postfix/smtp[854328]: connect to alt1.aspmx.l.google.com[2607:f8b0:4023:401::1b]:25: Network is unreachable
Jul 12 22:44:01 pve postfix/smtp[854328]: connect to alt1.aspmx.l.google.com[173.194.77.27]:25: Connection timed out
Jul 12 22:44:01 pve postfix/smtp[854328]: connect to alt2.aspmx.l.google.com[2607:f8b0:4002:c03::1b]:25: Network is unreachable
Jul 12 22:44:01 pve postfix/smtp[854328]: 64C8612196: to=<me@mydomain.redacted>, relay=none, delay=244083, delays=244023/0.01/61/0, dsn=4.4.1, status=deferred (connect to alt2.aspmx.l.google.com[2607:f8b0:4002:c03::1b]:25: Network is unreachable)
Jul 12 22:47:12 pve smartd[2051]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 52 to 53
Jul 12 22:47:12 pve smartd[2051]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 63 to 62
Literally the last thing in the log before I cycled power is this smartd alert about a small temperature change. I'm not sure if the postfix errors are related, either...