Folks, my PVE server keeps crashing randomly without any pattern. Initially, I had doubts about my UPS/PSU, but I have replaced both of them to no avail.
Here's what happened in the last 24 hours, this is mission-critical for me as I self host multiple services and heavily depend on this server while I am not around.
I have done some basic troubleshooting before writing this post, I will share my thoughts and some of the logs below:
Here's what happened in the last 24 hours, this is mission-critical for me as I self host multiple services and heavily depend on this server while I am not around.
I have done some basic troubleshooting before writing this post, I will share my thoughts and some of the logs below:
Logs
$ less /var/log/syslog
Code:
Jun 1 00:14:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 1 00:15:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun 1 00:15:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun 1 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 1 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 1 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun 1 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun 1 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio'
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio_pci'
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'wireguard'
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'coretemp'
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'iscsi_tcp'
Jun 1 17:52:22 matrix kernel: [ 0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun 1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'ib_iser'
Jun 1 17:52:22 matrix kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun 1 17:52:22 matrix systemd[1]: Starting Flush Journal to Persistent Storage...
Jun 1 17:52:22 matrix kernel: [ 0.000000] KERNEL supported cpus:
Jun 1 17:52:22 matrix kernel: [ 0.000000] Intel GenuineIntel
$ journalctl
Code:
Jun 01 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 01 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root
Jun 01 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun 01 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun 01 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
-- Reboot --
Jun 01 17:52:19 matrix kernel: Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun 01 17:52:19 matrix kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_
Jun 01 17:52:19 matrix kernel: KERNEL supported cpus:
Jun 01 17:52:19 matrix kernel: Intel GenuineIntel
Jun 01 17:52:19 matrix kernel: AMD AuthenticAMD
Jun 01 17:52:19 matrix kernel: Hygon HygonGenuine
Jun 01 17:52:19 matrix kernel: Centaur CentaurHauls
Jun 01 17:52:19 matrix kernel: zhaoxin Shanghai
$ less /var/log/messages
Code:
May 31 23:33:55 matrix kernel: [43870.045187] , receive & transmit flow control ON
May 31 23:33:55 matrix kernel: [43870.045298] vmbr0: port 1(eno1) entered blocking state
May 31 23:33:55 matrix kernel: [43870.045308] vmbr0: port 1(eno1) entered forwarding state
Jun 1 00:00:03 matrix rsyslogd: [origin software="rsyslogd" swVersion="8.1901.0" x-pid="997" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jun 1 17:52:22 matrix kernel: [ 0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun 1 17:52:22 matrix kernel: [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun 1 17:52:22 matrix kernel: [ 0.000000] KERNEL supported cpus:
Jun 1 17:52:22 matrix kernel: [ 0.000000] Intel GenuineIntel
$ less /var/log/debug
Code:
May 31 11:22:57 matrix kernel: [ 2.461263] sd 0:0:2:0: [sdc] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [ 2.465010] sd 0:0:4:0: [sde] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [ 2.465232] sd 0:0:5:0: [sdf] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [ 2.513432] sd 0:0:3:0: [sdd] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [ 2.874144] sd 0:0:1:0: [sdb] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [ 4.326440] sd 1:0:0:0: [sdg] Mode Sense: 03 00 00 00
Jun 1 17:52:22 matrix kernel: [ 0.005306] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jun 1 17:52:22 matrix kernel: [ 0.005309] e820: remove [mem 0x000a0000-0x000fffff] usable
Jun 1 17:52:22 matrix kernel: [ 0.005321] MTRR default type: uncachable
Jun 1 17:52:22 matrix kernel: [ 0.005323] MTRR fixed ranges enabled:
Jun 1 17:52:22 matrix kernel: [ 0.005324] 00000-9FFFF write-back
Jun 1 17:52:22 matrix kernel: [ 0.005326] A0000-BFFFF uncachable
Jun 1 17:52:22 matrix kernel: [ 0.005327] C0000-CBFFF write-protect
Jun 1 17:52:22 matrix kernel: [ 0.005328] CC000-D3FFF write-back
$ less /var/log/auth.log
Code:
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session opened for user root by (uid=0)
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session closed for user root
Jun 1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session closed for user root
Jun 1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun 1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root
Jun 1 17:52:22 matrix systemd-logind[995]: New seat seat0.
Jun 1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event0 (Power Button)
Jun 1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event1 (Avocent USB Composite Device-0)
Jun 1 17:52:23 matrix sshd[1164]: Server listening on 0.0.0.0 port 22.
Jun 1 17:52:23 matrix sshd[1164]: Server listening on :: port 22.