Debuging Promox VE kernel panic / crash

northernbrewer

New Member
Sep 22, 2023
6
0
1
Hi,

I have a reoccurring problem of my Proxmox server getting kernel panic and crashing. I have not been able to reproduce it in a consistent manner. Somethimes it can be two weeks between a crash sometimes less than 24 hours.

I have tried changing HW components like RAM, CPU, MB and so on. I think I have tried changing all components except the raid controller. I also did a fresh install of Proxmox and loaded the containers and VM from backup into the the fresh install.

I tried to enable crash dump, but after crashes I get no crash dumps in the /var/crash - I followed the steps and verification in this guide:
https://www.cyberciti.biz/faq/how-to-on-enable-kernel-crash-dump-on-debian-linux/

I have attached a photo of the kernel panic if anyone can make sense of it.

Any tips on how to debug this? Or how to get crash dump to create dumps?


To me the journal doesn't show anything usefull:

Code:
Sep 21 23:09:07 pve kernel: audit: type=1400 audit(1695330547.115:129): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13>
Sep 21 23:17:01 pve CRON[810666]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Sep 21 23:17:01 pve CRON[810667]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
Sep 21 23:17:01 pve CRON[810666]: pam_unix(cron:session): session closed for user root
Sep 21 23:38:15 pve kernel: mce: [Hardware Error]: Machine check events logged
Sep 21 23:39:50 pve audit[906578]: AVC apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13 profile="lxc-107_</var/lib/lxc>">
Sep 21 23:39:50 pve kernel: audit: type=1400 audit(1695332390.318:130): apparmor="DENIED" operation="mount" class="mount" info="failed perms check" error=-13>
Sep 22 00:00:07 pve systemd[1]: Starting dpkg-db-backup.service - Daily dpkg database backup service...
Sep 22 00:00:07 pve systemd[1]: Starting logrotate.service - Rotate log files...
Sep 22 00:00:07 pve systemd[1]: dpkg-db-backup.service: Deactivated successfully.
Sep 22 00:00:07 pve systemd[1]: Finished dpkg-db-backup.service - Daily dpkg database backup service.
Sep 22 00:00:07 pve systemd[1]: Reloading pveproxy.service - PVE API Proxy Server...
Sep 22 00:00:07 pve pveproxy[986078]: send HUP to 3421
Sep 22 00:00:07 pve pveproxy[3421]: received signal HUP
Sep 22 00:00:07 pve pveproxy[3421]: server closing
Sep 22 00:00:07 pve pveproxy[3421]: server shutdown (restart)
Sep 22 00:00:07 pve systemd[1]: Reloaded pveproxy.service - PVE API Proxy Server.
Sep 22 00:00:07 pve systemd[1]: Reloading spiceproxy.service - PVE SPICE Proxy Server...
Sep 22 00:00:07 pve spiceproxy[986085]: send HUP to 3435
Sep 22 00:00:07 pve spiceproxy[3435]: received signal HUP
Sep 22 00:00:07 pve spiceproxy[3435]: server closing
Sep 22 00:00:07 pve spiceproxy[3435]: server shutdown (restart)
Sep 22 00:00:07 pve systemd[1]: Reloaded spiceproxy.service - PVE SPICE Proxy Server.
Sep 22 00:00:07 pve pvefw-logger[2283]: received terminate request (signal)
Sep 22 00:00:07 pve systemd[1]: Stopping pvefw-logger.service - Proxmox VE firewall logger...
Sep 22 00:00:07 pve pvefw-logger[2283]: stopping pvefw logger
Sep 22 00:00:07 pve spiceproxy[3435]: restarting server
Sep 22 00:00:07 pve spiceproxy[3435]: starting 1 worker(s)
Sep 22 00:00:07 pve spiceproxy[3435]: worker 986094 started
Sep 22 00:00:07 pve pveproxy[3421]: restarting server
Sep 22 00:00:07 pve pveproxy[3421]: starting 3 worker(s)
Sep 22 00:00:07 pve pveproxy[3421]: worker 986095 started
Sep 22 00:00:07 pve pveproxy[3421]: worker 986096 started
Sep 22 00:00:07 pve pveproxy[3421]: worker 986097 started
Sep 22 00:00:08 pve systemd[1]: pvefw-logger.service: Deactivated successfully.
Sep 22 00:00:08 pve systemd[1]: Stopped pvefw-logger.service - Proxmox VE firewall logger.
Sep 22 00:00:08 pve systemd[1]: Starting pvefw-logger.service - Proxmox VE firewall logger...
Sep 22 00:00:08 pve pvefw-logger[986104]: starting pvefw logger
Sep 22 00:00:08 pve systemd[1]: Started pvefw-logger.service - Proxmox VE firewall logger.
Sep 22 00:00:08 pve systemd[1]: logrotate.service: Deactivated successfully.
Sep 22 00:00:08 pve systemd[1]: Finished logrotate.service - Rotate log files.
Sep 22 00:00:12 pve spiceproxy[3436]: worker exit
Sep 22 00:00:12 pve spiceproxy[3435]: worker 3436 finished
Sep 22 00:00:12 pve pveproxy[241796]: worker exit
Sep 22 00:00:12 pve pveproxy[231159]: worker exit
Sep 22 00:00:12 pve pveproxy[257281]: worker exit
Sep 22 00:00:12 pve pveproxy[3421]: worker 231159 finished
Sep 22 00:00:12 pve pveproxy[3421]: worker 257281 finished
Sep 22 00:00:12 pve pveproxy[3421]: worker 241796 finished
-- Boot d3ba03c45a0d4a08926f1f6a037c0185 --
Sep 22 05:59:29 pve kernel: Linux version 6.2.16-12-pve (build@proxmox) (gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT>
Sep 22 05:59:29 pve kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-6.2.16-12-pve root=/dev/mapper/pve-root ro nomodeset crashkernel=384M-:128M
Sep 22 05:59:29 pve kernel: KERNEL supported cpus:
Sep 22 05:59:29 pve kernel:   Intel GenuineIntel
Sep 22 05:59:29 pve kernel:   AMD AuthenticAMD
Sep 22 05:59:29 pve kernel:   Hygon HygonGenuine
Sep 22 05:59:29 pve kernel:   Centaur CentaurHauls
Sep 22 05:59:29 pve kernel:   zhaoxin   Shanghai
Sep 22 05:59:29 pve kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
Sep 22 05:59:29 pve kernel: BIOS-provided physical RAM map:
Sep 22 05:59:29 pve kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000005dfff] usable

Thank you
 

Attachments

  • IMG_1076_2.jpg
    IMG_1076_2.jpg
    476.6 KB · Views: 13
Any tips on how to debug this?
Serial console logging and/or setting up netconsole to get the complete output of your crash if crashdump is not working.

I followed the steps and verification in this guide:
So, via sysrq, you get a crash and a crashdump, but your current problem does not create a crashdump? Then it's bad ...
 
Serial console logging and/or setting up netconsole to get the complete output of your crash if crashdump is not working.


So, via sysrq, you get a crash and a crashdump, but your current problem does not create a crashdump? Then it's bad ...
Thank you for your pointers.

My motherboard does not have a serial port, but I will try netconsole

I didn't actually crash the system with sysrq to verify crash logging. But all other validations passed so I assumed that crash dump is working. But I will try to force a crash with sysrq to see if it logs.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!