PVE server keeps crashing randomly, need help in debugging.

v01d4lph4

New Member
Jun 1, 2021
2
0
1
32
Folks, my PVE server keeps crashing randomly without any pattern. Initially, I had doubts about my UPS/PSU, but I have replaced both of them to no avail.

Here's what happened in the last 24 hours, this is mission-critical for me as I self host multiple services and heavily depend on this server while I am not around.
1622556866097.png

I have done some basic troubleshooting before writing this post, I will share my thoughts and some of the logs below:

Logs​


$ less /var/log/syslog
Code:
Jun  1 00:14:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 00:15:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun  1 00:15:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun  1 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun  1 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun  1 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.
Jun  1 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'vfio_pci'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'wireguard'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'coretemp'
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'iscsi_tcp'
Jun  1 17:52:22 matrix kernel: [    0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun  1 17:52:22 matrix systemd-modules-load[553]: Inserted module 'ib_iser'
Jun  1 17:52:22 matrix kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun  1 17:52:22 matrix systemd[1]: Starting Flush Journal to Persistent Storage...
Jun  1 17:52:22 matrix kernel: [    0.000000] KERNEL supported cpus:
Jun  1 17:52:22 matrix kernel: [    0.000000]   Intel GenuineIntel


$ journalctl
Code:
Jun 01 00:15:01 matrix systemd[1]: Started Proxmox VE replication runner.
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)                                                                                                             
Jun 01 00:15:01 matrix CRON[13306]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jun 01 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root                                                                                                                        
Jun 01 00:16:00 matrix systemd[1]: Starting Proxmox VE replication runner...
Jun 01 00:16:01 matrix systemd[1]: pvesr.service: Succeeded.                                                                                                                                                    
Jun 01 00:16:01 matrix systemd[1]: Started Proxmox VE replication runner.
-- Reboot --                                                                                                                                                                                                    
Jun 01 17:52:19 matrix kernel: Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun 01 17:52:19 matrix kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_
Jun 01 17:52:19 matrix kernel: KERNEL supported cpus:
Jun 01 17:52:19 matrix kernel:   Intel GenuineIntel                                                                                                                                                             
Jun 01 17:52:19 matrix kernel:   AMD AuthenticAMD
Jun 01 17:52:19 matrix kernel:   Hygon HygonGenuine                                                                                                                                                             
Jun 01 17:52:19 matrix kernel:   Centaur CentaurHauls
Jun 01 17:52:19 matrix kernel:   zhaoxin   Shanghai

$ less /var/log/messages
Code:
May 31 23:33:55 matrix kernel: [43870.045187] , receive & transmit flow control ON
May 31 23:33:55 matrix kernel: [43870.045298] vmbr0: port 1(eno1) entered blocking state
May 31 23:33:55 matrix kernel: [43870.045308] vmbr0: port 1(eno1) entered forwarding state
Jun  1 00:00:03 matrix rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="997" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jun  1 17:52:22 matrix kernel: [    0.000000] Linux version 5.4.114-1-pve (build@proxmox) (gcc version 8.3.0 (Debian 8.3.0-6)) #1 SMP PVE 5.4.114-1 (Sun, 09 May 2021 17:13:05 +0200) ()
Jun  1 17:52:22 matrix kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.114-1-pve root=/dev/mapper/pve-root ro quiet intremap=off intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interru
pts=1 pcie_acs_override=downstream intel_idle.max_cstate=1
Jun  1 17:52:22 matrix kernel: [    0.000000] KERNEL supported cpus:
Jun  1 17:52:22 matrix kernel: [    0.000000]   Intel GenuineIntel

$ less /var/log/debug
Code:
May 31 11:22:57 matrix kernel: [    2.461263] sd 0:0:2:0: [sdc] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.465010] sd 0:0:4:0: [sde] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.465232] sd 0:0:5:0: [sdf] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.513432] sd 0:0:3:0: [sdd] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    2.874144] sd 0:0:1:0: [sdb] Mode Sense: 7f 00 10 08
May 31 11:22:57 matrix kernel: [    4.326440] sd 1:0:0:0: [sdg] Mode Sense: 03 00 00 00
Jun  1 17:52:22 matrix kernel: [    0.005306] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Jun  1 17:52:22 matrix kernel: [    0.005309] e820: remove [mem 0x000a0000-0x000fffff] usable
Jun  1 17:52:22 matrix kernel: [    0.005321] MTRR default type: uncachable
Jun  1 17:52:22 matrix kernel: [    0.005323] MTRR fixed ranges enabled:
Jun  1 17:52:22 matrix kernel: [    0.005324]   00000-9FFFF write-back
Jun  1 17:52:22 matrix kernel: [    0.005326]   A0000-BFFFF uncachable
Jun  1 17:52:22 matrix kernel: [    0.005327]   C0000-CBFFF write-protect
Jun  1 17:52:22 matrix kernel: [    0.005328]   CC000-D3FFF write-back

$ less /var/log/auth.log
Code:
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session opened for user root by (uid=0)
May 31 23:59:01 matrix CRON[9516]: pam_unix(cron:session): session closed for user root
Jun  1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun  1 00:05:01 matrix CRON[10940]: pam_unix(cron:session): session closed for user root
Jun  1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session opened for user root by (uid=0)
Jun  1 00:15:01 matrix CRON[13305]: pam_unix(cron:session): session closed for user root
Jun  1 17:52:22 matrix systemd-logind[995]: New seat seat0.
Jun  1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event0 (Power Button)
Jun  1 17:52:22 matrix systemd-logind[995]: Watching system buttons on /dev/input/event1 (Avocent USB Composite Device-0)
Jun  1 17:52:23 matrix sshd[1164]: Server listening on 0.0.0.0 port 22.
Jun  1 17:52:23 matrix sshd[1164]: Server listening on :: port 22.
 

Have already tried​

* Reseating the RAM and CPU. Have tried booting with each stick of RAM and checked POST for any errors.
* Reducing the load to a couple of VMs only.
* Tried simulating stress with:
Code:
  $ stress --cpu 20 --vm 10 --vm-bytes 4096M --timeout 600s
  $  stress --cpu 18 --vm 4 --vm-bytes 8192M                                                                                                                                                                 
  $  stress --cpu 20 --vm 6 --vm-bytes 8192M
* Read multiple issues on stackoverflow and proxmox forums
* Enabled rasdaemon for extensive logging:
Code:
$ ras-mc-ctl --status
ras-mc-ctl: drivers are loaded.                                                                                                                                                                                 

$ ras-mc-ctl --layout
       +-----------------------------------------------------------------------+                                                                                                                                
       |                mc0                |                mc1                |
       | channel0  | channel1  | channel2  | channel0  | channel1  | channel2  |                                                                                                                                
-------+-----------------------------------------------------------------------+
slot2: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |                                                                                                                                
slot1: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
slot0: |  16384 MB  |  16384 MB  |     0 MB  |  16384 MB  |  16384 MB  |     0 MB  |                                                                                                                            
-------+---------------------------------------------------------------------------+

$ ras-mc-ctl --summary
No Memory errors.
                                                                                                                                                                                                                
No PCIe AER errors.
                                                                                                                                                                                                                
No Extlog errors.
No MCE errors.


-----

Is there any other information I need to check? Would appreciate any pointers.
 
was the server really offline for ~18hours? or did it actually shutdown much later? that would point to a faulty/full disk since the logs could not be written to disk and would mask the real error..
any events in the ipmi/idrac/etc ? (if it exists)

i'd probably setup some remote syslog service so that the logs can be sent there even when the disks are not working anymore
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!