PVE server keeps crashing randomly, need help in debugging.

v01d4lph4

New Member
Jun 1, 2021
2
0
1
32
Folks, my PVE server keeps crashing randomly without any pattern. Initially, I had doubts about my UPS/PSU, but I have replaced both of them to no avail.

Here's what happened in the last 24 hours, this is mission-critical for me as I self host multiple services and heavily depend on this server while I am not around.
1622556866097.png

I have done some basic troubleshooting before writing this post, I will share my thoughts and some of the logs below:

Logs​


$ less /var/log/syslog
Code:
Jun  1 00:14:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 00:15:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun  1 00:15:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun  1 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun  1 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun  1 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun  1 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio_pci'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'wireguard'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'coretemp'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'iscsi_tcp'
Jun  1 17:52:22 matrix kernel: [    0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'ib_iser'
Jun  1 17:52:22 matrix kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun  1 17:52:22 matrix systemd[1]: Starting Flush Journal to Persistent Storage...
Jun  1 17:52:22 matrix kernel: [    0.000000] KERNEL supported cpus:
Jun  1 17:52:22 matrix kernel: [    0.000000]   Intel GenuineIntel


$ journalctl
Code:
Jun 01 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)                                                                                                             
Jun 01 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root                                                                                                                        
Jun 01 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun 01 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.                                                                                                                                                    
Jun 01 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
-- Reboot --                                                                                                                                                                                                    
Jun 01 17:52:19 matrix kernel: Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun 01 17:52:19 matrix kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_
Jun 01 17:52:19 matrix kernel: KERNEL supported cpus:
Jun 01 17:52:19 matrix kernel:   Intel GenuineIntel                                                                                                                                                             
Jun 01 17:52:19 matrix kernel:   AMD AuthenticAMD
Jun 01 17:52:19 matrix kernel:   Hygon HygonGenuine                                                                                                                                                             
Jun 01 17:52:19 matrix kernel:   Centaur CentaurHauls
Jun 01 17:52:19 matrix kernel:   zhaoxin   Shanghai

$ less /var/log/messages
Code:
May 31 23:33:55 matrix kernel: [43870.045187] , receive & transmit flow control ON
May 31 23:33:55 matrix kernel: [43870.045298] vmbr0: port 1(eno1) entered blocking state
May 31 23:33:55 matrix kernel: [43870.045308] vmbr0: port 1(eno1) entered forwarding state
Jun  1 00:00:03 matrix rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="997" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jun  1 17:52:22 matrix kernel: [    0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun  1 17:52:22 matrix kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun  1 17:52:22 matrix kernel: [    0.000000] KERNEL supported cpus:
Jun  1 17:52:22 matrix kernel: [    0.000000]   Intel GenuineIntel

$ less /var/log/debug
Code:
May 31 11:22:57 matrix kernel: [    2.461263] sd 0:0:2:0: [sdc] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.465010] sd 0:0:4:0: [sde] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.465232] sd 0:0:5:0: [sdf] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.513432] sd 0:0:3:0: [sdd] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.874144] sd 0:0:1:0: [sdb] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    4.326440] sd 1:0:0:0: [sdg] Mode Sense: 03 00 00 00
Jun  1 17:52:22 matrix kernel: [    0.005306] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jun  1 17:52:22 matrix kernel: [    0.005309] e820: remove [mem 0x000a0000-0x000fffff] usable
Jun  1 17:52:22 matrix kernel: [    0.005321] MTRR default type: uncachable
Jun  1 17:52:22 matrix kernel: [    0.005323] MTRR fixed ranges enabled:
Jun  1 17:52:22 matrix kernel: [    0.005324]   00000-9FFFF write-back
Jun  1 17:52:22 matrix kernel: [    0.005326]   A0000-BFFFF uncachable
Jun  1 17:52:22 matrix kernel: [    0.005327]   C0000-CBFFF write-protect
Jun  1 17:52:22 matrix kernel: [    0.005328]   CC000-D3FFF write-back

$ less /var/log/auth.log
Code:
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session opened for user root by (uid=0)
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session closed for user root
Jun  1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun  1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session closed for user root
Jun  1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun  1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root
Jun  1 17:52:22 matrix systemd-logind[995]: New seat seat0.
Jun  1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event0 (Power Button)
Jun  1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event1 (Avocent USB Composite Device-0)
Jun  1 17:52:23 matrix sshd[1164]: Server listening on 0.0.0.0 port 22.
Jun  1 17:52:23 matrix sshd[1164]: Server listening on :: port 22.
 

Have already tried​

* Reseating the RAM and CPU. Have tried booting with each stick of RAM and checked POST for any errors.
* Reducing the load to a couple of VMs only.
* Tried simulating stress with:
Code:
  $ stress --cpu 20 --vm 10 --vm-bytes 4096M --timeout 600s
  $  stress --cpu 18 --vm 4 --vm-bytes 8192M                                                                                                                                                                 
  $  stress --cpu 20 --vm 6 --vm-bytes 8192M
* Read multiple issues on stackoverflow and proxmox forums
* Enabled rasdaemon for extensive logging:
Code:
$ ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.                                                                                                                                                                                 

$ ras-mc-ctl --layout
       +-----------------------------------------------------------------------+                                                                                                                                
       |                mc0                |                mc1                |
       | channel0  | channel1  | channel2  | channel0  | channel1  | channel2  |                                                                                                                                
-------+-----------------------------------------------------------------------+
slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |                                                                                                                                
slot1: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot0: |  16384 MB  |  16384 MB  |     0 MB  |  16384 MB  |  16384 MB  |     0 MB  |                                                                                                                            
-------+---------------------------------------------------------------------------+

$ ras-mc-ctl --summary
No Memory errors.
                                                                                                                                                                                                                
No PCIe AER errors.
                                                                                                                                                                                                                
No Extlog errors.
No MCE errors.


-----

Is there any other information I need to check? Would appreciate any pointers.
 
was the server really offline for ~18hours? or did it actually shutdown much later? that would point to a faulty/full disk since the logs could not be written to disk and would mask the real error..
any events in the ipmi/idrac/etc ? (if it exists)

i'd probably setup some remote syslog service so that the logs can be sent there even when the disks are not working anymore