Kernel crash issue

tlex

Member
Mar 9, 2021
103
14
23
43
I'm having an issue where proxmox will crash from time to time..
Not being a linux specialist anyhow, I came accross this post mentioning how to enable kdump:
https://forum.proxmox.com/threads/random-proxmox-server-hang-no-vms-no-web-gui.58823/

Code:
# apt install kdump-tools

Select no for kexec reboots
Select yes for enabling kdump-tools

# $EDITOR /etc/default/grub

Add 'nmi_watchdog=1' to the end of 'GRUB_CMDLINE_LINUX_DEFAULT'

# $EDITOR /etc/default/grub.d/kdump-tools.cfg

Change 128M to 256M at the end of the line

# update-grub
# reboot
# cat /sys/kernel/kexec_crash_loaded

Now the thing is I did not "Select no for kexec reboots" and I was wondering how can I restart that configuration gui ?
I tried to uninstall and reinstall kdump-tools but I don't get the configuration menu anymore...

I know this must be really trivial but I just can't figure it.. any help appreciated :)
 
ok so I was able to reconfigure it but now it does not seem to be active :
Code:
kdump-config status
no crashkernel= parameter in the kernel cmdline ... failed!
Invalid symlink : /var/lib/kdump/initrd.img ... failed!
Invalid symlink : /var/lib/kdump/vmlinuz ... failed!
current state   : Not ready to kdump

and
cat /sys/kernel/kexec_crash_loaded
returns : 0

Could anyone guide me with a working tutorial on how to enable it ?
I really need to find why the system is crashing from time to time :)
 
On a hunch - is the system maybe booted with systemd-boot (is the case if you're running with ZFS as filesystem for '/' and boot with UEFI):
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot

In that case you need to edit /etc/kernel/cmdline instead of /etc/default/grub

if that's not the case my next guess would be that you either did not reboot, or that you did not regenerate the grub-config (`update-grub`)

I hope this helps!
 
On a hunch - is the system maybe booted with systemd-boot (is the case if you're running with ZFS as filesystem for '/' and boot with UEFI):
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot

In that case you need to edit /etc/kernel/cmdline instead of /etc/default/grub

if that's not the case my next guess would be that you either did not reboot, or that you did not regenerate the grub-config (`update-grub`)

I hope this helps!
you are right, I do run ZFS on the filesystem with UEFI.
right now, the files contains :

/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs

/etc/default/grub:
GRUB_DEFAULT=0 GRUB_TIMEOUT=5 GRUB_DISTRIBUTOR="Proxmox Virtual Environment" GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=128M" GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nmi_watchdog=1 crashkernel=384M-:256M"

/etc/default/grub.d/kdump-tools.cfg:
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=384M-:256M"

/etc/default/kdump-tools:
USE_KDUMP=1 KDUMP_KERNEL=/var/lib/kdump/vmlinuz #KDUMP_INITRD=/var/lib/kdump/initrd.img KDUMP_COREDIR="/var/crash"

Any idea how it should be configured ?
 
I tried to edit as follow :

/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:256M

Then run pve-efiboot-tool refresh and reboot

now kdump seems to run :
root@pve:~# kdump-config show DUMP_MODE: kdump USE_KDUMP: 1 KDUMP_COREDIR: /var/crash crashkernel addr: 0xa4000000 /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.13.19-1-pve kdump initrd: current state: ready to kdump kexec command: /sbin/kexec -p --command-line="initrd=\EFI\proxmox.13.19-1-pve\initrd.img-5.13.19-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" /var/lib/kdump/vmlinuz

Does that mean that next time the system crash I will find a log file under /var/crash ?
 
If I try to enable :
KDUMP_INITRD=/var/lib/kdump/initrd.img in the file /etc/default/kdump-tools

I get this error after rebooting when running kdump-config status :
Invalid symlink : /var/lib/kdump/initrd.img ... failed! current state : Not ready to kdump

the file /var/lib/kdump/initrd.img does exist.
Any Idea ?
 
Hi,
what is the status after running kdump-config load?
 
  • Like
Reactions: Stoiko Ivanov
Hi,
what is the status after running kdump-config load?
kdump-config status Invalid symlink : /var/lib/kdump/initrd.img ... failed! current state : ready to kdump kdump-config load Cannot change symbolic links when kdump is loaded ... failed! kdump-config show DUMP_MODE: kdump USE_KDUMP: 1 KDUMP_COREDIR: /var/crash crashkernel addr: 0xa4000000 /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.13.19-1-pve kdump initrd: current state: ready to kdump kexec command: /sbin/kexec -p --command-line="initrd=\EFI\proxmox.13.19-1-pve\initrd.img-5.13.19-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" /var/lib/kdump/vmlinuz

I'm definitely missing something :(
 
Last edited:
kdump-config load Cannot change symbolic links when kdump is loaded ... failed!
Please try to:
1. kdump-config unload
2. add the setting KDUMP_INITRD=/var/lib/kdump/initrd.img (if it's not there already).
3. kdump-config load.
 
Please try to:
1. kdump-config unload
2. add the setting KDUMP_INITRD=/var/lib/kdump/initrd.img (if it's not there already).
3. kdump-config load.
after adding KDUMP_INITRD=/var/lib/kdump/initrd.img
and running kdump-config load
I got :
kdump-config load Creating symlink /var/lib/kdump/vmlinuz. kdump-tools: Generating /var/lib/kdump/initrd.img-5.13.19-1-pve mkinitramfs: failed to determine device for / mkinitramfs: workaround is MODULES=most, check: grep -r MODULES /var/lib/kdump/initramfs-tools Error please report bug on initramfs-tools Include the output of 'mount' and 'cat /proc/mounts' update-initramfs: failed for /var/lib/kdump/initrd.img-5.13.19-1-pve with 1. Creating symlink /var/lib/kdump/initrd.img. Invalid symlink : /var/lib/kdump/initrd.img ... failed! Creating symlink /var/lib/kdump/initrd.img. /etc/default/kdump-tools: KDUMP_INITRD does not exist: /var/lib/kdump/initrd.img ... failed!

kdump-config status Invalid symlink : /var/lib/kdump/initrd.img ... failed! current state : Not ready to kdump
 
Last edited:
EDIT3: Better, less invasive work-around.

This indeed seems to be a bug with mkinitramfs not finding the correct modules it needs to load for ZFS. This seems to be a workaround, i.e. add MODULES=most to /usr/share/initramfs-tools/conf-hooks.d/zfs.

EDIT2: I had to increase the memory for the crashkernel to 512M (and issue proxmox-boot-tool refresh and reboot), or it wouldn't succeed after a panic.

EDIT: Sorry, I need to re-check this workaround. It might break booting! EDIT2: After more testing, I think it was not the above workaround that made my test machine unbootable, but it happened after building the usual initrd (i.e. not the kdump one) with MODULES=dep (don't do that unless you know what you're doing ;)) Still, it's always good to have a backup plan!

It seems like Ubuntu made a proper fix in their initramfs-tools package, but it apparently didn't end up upstream (or at least not in Debian yet).
 
Last edited:
This indeed seems to be a bug with mkinitramfs not finding the correct modules it needs to load for ZFS. This seems to be a workaround, i.e. add MODULES=most to /usr/share/initramfs-tools/conf-hooks.d/zfs.
EDIT: Sorry, I need to re-check this workaround. It might break booting!

It seems like Ubuntu made a proper fix in their initramfs-tools package, but it apparently didn't end up upstream (or at least not in Debian yet).
I'm glad that you catch it before I test :)
 
Sorry, didn't see you response yet (didn't reload the page). See the EDIT2 in my previous post, especially the

part ;)
If I may ask, where did you changed the crashkernel to 512M ?
Is it like this :
/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:512M
EDIT1 :
Or should it be like this to fit all ram possibilities ?
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M
 
Last edited:
If I may ask, where did you changed the crashkernel to 512M ?
Is it like this :
/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:512M
EDIT1 :
Or should it be like this to fit all ram possibilities ?
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M
Yes, you can also use a more flexible option. I simply used crashkernel=384M-:512M for my test VM with 6GB RAM, but I can't predict what value will work for you.
 
So I got my first reboot/crash since I installed kdump.
I can see the dmesg log and the dumpfile I don't know how to read the dump itself... I was wondering if someone here could help troubleshoot based on these files content ?


/var/crash/202112250917/dmesg.202112250917 in attachment
 

Attachments

  • dmesg.txt
    115.2 KB · Views: 6
I'm still trying to troubleshoot why my small Proxmox server keeps crashing randomly.
I have kdump installed and running, and now I try to follow this guide in order to understand (maybe) the last crash that I had yesterday:
https://www.linuxjournal.com/content/oops-debugging-kernel-panics-0

I'm having issues with the instructions where the commands says to install the kernel source :
apt source linux-image-`uname -r`

That command was not working so I ran :
apt-get install pve-headers-`uname -r`

Now I'm at the step where they say to run the command "crash dump.201902261006 /usr/lib/debug/vmlinux-4.9.0-8-amd64"
I don't know what to replace "/usr/lib/debug/vmlinux-4.9.0-8-amd64" with. Any idea ?

Running 5.15.12-1-pve here.

Other than that, Machine was pretty stable with windows. It has ECC ram that I also checked with memtest.

I've being searching for months now trying to replace some hardware in it to randomly see if I could isolate the issue without success.
Any help would really be appreciated.


dmesg --level=err [ 1.275707] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.GP17.VGA.LCD._BCM.AFN7], AE_NOT_FOUND (20210730/psargs-330) [ 1.282792] ACPI Error: Aborting method \_SB.PCI0.GP17.VGA.LCD._BCM due to previous error (AE_NOT_FOUND) (20210730/psparse-529) [ 5.228096] ================================================================================ [ 5.228098] UBSAN: invalid-load in drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6044:84 [ 5.228101] load of value 56 is not a valid value for type '_Bool' [ 5.229342] ================================================================================

dmesg --level=warn [ 0.507384] pci 0000:00:00.2: can't derive routing for PCI INT A [ 0.507386] pci 0000:00:00.2: PCI INT A: not connected [ 0.604194] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log. [ 0.604255] platform eisa.0: EISA: Cannot allocate resource for mainboard [ 0.604257] platform eisa.0: Cannot allocate resource for EISA slot 1 [ 0.604259] platform eisa.0: Cannot allocate resource for EISA slot 2 [ 0.604260] platform eisa.0: Cannot allocate resource for EISA slot 3 [ 0.604262] platform eisa.0: Cannot allocate resource for EISA slot 4 [ 0.604263] platform eisa.0: Cannot allocate resource for EISA slot 5 [ 0.604265] platform eisa.0: Cannot allocate resource for EISA slot 6 [ 0.604266] platform eisa.0: Cannot allocate resource for EISA slot 7 [ 0.604268] platform eisa.0: Cannot allocate resource for EISA slot 8 [ 1.277132] Initialized Local Variables for Method [_BCM]: [ 1.278543] Local0: 00000000b3d4e529 <Obj> Integer 00000000000000FF [ 1.279261] Local1: 00000000bd04d367 <Obj> Integer 0000000000000000 [ 1.280641] Initialized Arguments for Method [_BCM]: (1 arguments defined for method invocation) [ 1.281360] Arg0: 000000006b1377a8 <Obj> Integer 0000000000000064 [ 1.368792] usb: port power management may be unreliable [ 1.775913] ata2.00: supports DRM functions and may not be fully accessible [ 1.778032] ata3.00: supports DRM functions and may not be fully accessible [ 1.784474] ata2.00: supports DRM functions and may not be fully accessible [ 1.785187] ata3.00: supports DRM functions and may not be fully accessible [ 3.367194] spl: loading out-of-tree module taints kernel. [ 3.369157] znvpair: module license 'CDDL' taints kernel. [ 3.370272] Disabling lock debugging due to kernel taint [ 4.086337] systemd-journald[1016]: File /var/log/journal/6646a341960f4315875730a404b9f973/system.journal corrupted or uncleanly shut down, renaming and replacing. [ 4.357298] amdgpu 0000:0a:00.0: amdgpu: PSP runtime database doesn't exist [ 5.227268] amdgpu: SRAT table not found [ 5.228104] CPU: 3 PID: 1158 Comm: systemd-udevd Tainted: P O 5.15.12-1-pve #1 [ 5.228107] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570M Pro4, BIOS P3.60 08/11/2021 [ 5.228109] Call Trace: [ 5.228112] <TASK> [ 5.228114] dump_stack_lvl+0x4a/0x5f [ 5.228121] dump_stack+0x10/0x12 [ 5.228122] ubsan_epilogue+0x9/0x45 [ 5.228124] __ubsan_handle_load_invalid_value.cold+0x44/0x49 [ 5.228127] create_stream_for_sink.cold+0x5d/0xbb [amdgpu] [ 5.228311] create_validate_stream_for_sink+0x59/0x150 [amdgpu] [ 5.228467] amdgpu_dm_connector_mode_valid+0x54/0x1a0 [amdgpu] [ 5.228613] ? drm_connector_list_update+0x186/0x1f0 [drm] [ 5.228630] drm_connector_mode_valid+0x3b/0x60 [drm_kms_helper] [ 5.228643] drm_helper_probe_single_connector_modes+0x3b5/0x880 [drm_kms_helper] [ 5.228652] ? krealloc+0x87/0xd0 [ 5.228656] drm_client_modeset_probe+0x2bf/0x1600 [drm] [ 5.228671] ? ktime_get_mono_fast_ns+0x52/0xb0 [ 5.228675] __drm_fb_helper_initial_config_and_unlock+0x49/0x500 [drm_kms_helper] [ 5.228685] ? mutex_lock+0x13/0x40 [ 5.228688] drm_fb_helper_initial_config+0x43/0x50 [drm_kms_helper] [ 5.228697] amdgpu_fbdev_init+0xd8/0x110 [amdgpu] [ 5.228804] amdgpu_device_init.cold+0x193a/0x1d51 [amdgpu] [ 5.228949] ? pci_read+0x53/0x60 [ 5.228953] ? pci_read_config_word+0x27/0x40 [ 5.228956] ? do_pci_enable_device.part.0+0xbc/0xe0 [ 5.228959] amdgpu_driver_load_kms+0x6d/0x330 [amdgpu] [ 5.229062] amdgpu_pci_probe+0x11e/0x1a0 [amdgpu] [ 5.229165] local_pci_probe+0x4b/0x90 [ 5.229168] pci_device_probe+0x115/0x1f0 [ 5.229170] really_probe+0x21e/0x420 [ 5.229174] __driver_probe_device+0x115/0x190 [ 5.229176] driver_probe_device+0x23/0xc0 [ 5.229178] __driver_attach+0xbd/0x1d0 [ 5.229180] ? __device_attach_driver+0x110/0x110 [ 5.229182] bus_for_each_dev+0x7e/0xc0 [ 5.229184] driver_attach+0x1e/0x20 [ 5.229186] bus_add_driver+0x135/0x200 [ 5.229188] driver_register+0x91/0xf0 [ 5.229190] __pci_register_driver+0x68/0x70 [ 5.229192] amdgpu_init+0x7c/0x1000 [amdgpu] [ 5.229284] ? 0xffffffffc30aa000 [ 5.229286] do_one_initcall+0x48/0x1d0 [ 5.229290] ? kmem_cache_alloc_trace+0x19e/0x2e0 [ 5.229293] do_init_module+0x62/0x2a0 [ 5.229296] load_module+0x2711/0x2b10 [ 5.229299] __do_sys_finit_module+0xbf/0x120 [ 5.229301] __x64_sys_finit_module+0x1a/0x20 [ 5.229303] do_syscall_64+0x5c/0xc0 [ 5.229305] ? syscall_exit_to_user_mode+0x27/0x50 [ 5.229307] ? do_syscall_64+0x69/0xc0 [ 5.229309] ? do_syscall_64+0x69/0xc0 [ 5.229310] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 5.229313] RIP: 0033:0x7f299496d9b9 [ 5.229315] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48 [ 5.229318] RSP: 002b:00007ffda64da4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 5.229322] RAX: ffffffffffffffda RBX: 000056031321ce00 RCX: 00007f299496d9b9 [ 5.229324] RDX: 0000000000000000 RSI: 00007f2994af8e2d RDI: 0000000000000019 [ 5.229325] RBP: 0000000000020000 R08: 0000000000000000 R09: 00005603131d9b60 [ 5.229327] R10: 0000000000000019 R11: 0000000000000246 R12: 00007f2994af8e2d [ 5.229329] R13: 0000000000000000 R14: 0000560313217050 R15: 000056031321ce00 [ 5.229330] </TASK> [ 9306.003685] hrtimer: interrupt took 2360 ns



cat dmesg.202202032254 | grep -i crash [ 0.000000] Command line: initrd=\EFI\proxmox\5.15.12-1-pve\initrd.img-5.15.12-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M [ 0.005387] Reserving 768MB of memory at 1216MB for crashkernel (System RAM: 64893MB) [ 0.123409] Kernel command line: initrd=\EFI\proxmox\5.15.12-1-pve\initrd.img-5.15.12-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M [ 4.245382] pstore: Using crash dump compression: deflate

cat dmesg.202202032254 | grep -i err [ 0.458039] ACPI: Using IOAPIC for interrupt routing [ 0.479973] ACPI: PCI: Interrupt link LNKA configured for IRQ 0 [ 0.480017] ACPI: PCI: Interrupt link LNKB configured for IRQ 0 [ 0.480054] ACPI: PCI: Interrupt link LNKC configured for IRQ 0 [ 0.480100] ACPI: PCI: Interrupt link LNKD configured for IRQ 0 [ 0.480142] ACPI: PCI: Interrupt link LNKE configured for IRQ 0 [ 0.480176] ACPI: PCI: Interrupt link LNKF configured for IRQ 0 [ 0.480211] ACPI: PCI: Interrupt link LNKG configured for IRQ 0 [ 0.480245] ACPI: PCI: Interrupt link LNKH configured for IRQ 0 [ 0.511330] AMD-Vi: Interrupt remapping enabled [ 1.163763] acpi_cpufreq: overriding BIOS provided _PSD data [ 1.164826] RAS: Correctable Errors collector initialized. [ 1.275539] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.GP17.VGA.LCD._BCM.AFN7], AE_NOT_FOUND (20210730/psargs-330) [ 1.280476] ACPI Error: Aborting method \_SB.PCI0.GP17.VGA.LCD._BCM due to previous error (AE_NOT_FOUND) (20210730/psparse-529) [ 1.336414] igb 0000:05:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s) [ 4.452081] EDAC MC0: Giving out device to module amd64_edac controller F17h_M60h: DEV 0000:00:18.3 (INTERRUPT) [11266.623843] hrtimer: interrupt took 6600 ns [21304.106420] sysvec_apic_timer_interrupt+0x7c/0x90 [21304.108221] asm_sysvec_apic_timer_interrupt+0x12/0x2

pveversion -v proxmox-ve: 7.1-1 (running kernel: 5.15.12-1-pve) pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe) pve-kernel-5.15: 7.1-8 pve-kernel-helper: 7.1-8 pve-kernel-5.13: 7.1-6 pve-kernel-5.15.12-1-pve: 5.15.12-3 pve-kernel-5.15.7-1-pve: 5.15.7-1 pve-kernel-5.13.19-3-pve: 5.13.19-7 pve-kernel-5.13.19-2-pve: 5.13.19-4 ceph-fuse: 15.2.15-pve1 corosync: 3.1.5-pve2 criu: 3.15-1+pve-1 glusterfs-client: 9.2-1 ifupdown: residual config ifupdown2: 3.1.0-1+pmx3 ksm-control-daemon: 1.4-1 libjs-extjs: 7.0.0-1 libknet1: 1.22-pve2 libproxmox-acme-perl: 1.4.1 libproxmox-backup-qemu0: 1.2.0-1 libpve-access-control: 7.1-6 libpve-apiclient-perl: 3.2-1 libpve-common-perl: 7.1-2 libpve-guest-common-perl: 4.0-3 libpve-http-server-perl: 4.1-1 libpve-storage-perl: 7.0-15 libqb0: 1.0.5-1 libspice-server1: 0.14.3-2.1 lvm2: 2.03.11-2.1 lxc-pve: 4.0.11-1 lxcfs: 4.0.11-pve1 novnc-pve: 1.3.0-1 proxmox-backup-client: 2.1.4-1 proxmox-backup-file-restore: 2.1.4-1 proxmox-mini-journalreader: 1.3-1 proxmox-widget-toolkit: 3.4-5 pve-cluster: 7.1-3 pve-container: 4.1-3 pve-docs: 7.1-2 pve-edk2-firmware: 3.20210831-2 pve-firewall: 4.2-5 pve-firmware: 3.3-4 pve-ha-manager: 3.3-3 pve-i18n: 2.6-2 pve-qemu-kvm: 6.1.0-3 pve-xtermjs: 4.16.0-1 qemu-server: 7.1-4 smartmontools: 7.2-pve2 spiceterm: 3.2-2 swtpm: 0.7.0~rc1+2 vncterm: 1.7-1 zfsutils-linux: 2.1.2-pve1
 
Now I'm at the step where they say to run the command "crash dump.201902261006 /usr/lib/debug/vmlinux-4.9.0-8-amd64"
I don't know what to replace "/usr/lib/debug/vmlinux-4.9.0-8-amd64" with. Any idea ?
I've got random crashes on my PVE host as well, and I have the same question as @tlex. I've installed both kdump and kexec, and I can manually issue a crash using this command:

sync; echo c | tee /proc/sysrq-trigger

Everything works, and the 'dmesg.<something>' and 'dump.<something>' are stored in '/var/crash'.

I can use this command to non-verbosely parse the file 'dmesg.<something>':

cd /var/crash/<date_of_crash> grep -iE 'code1|error|warning|failed|hardware|crash' dmesg.<something>

However, just like @tlex, I can't use the 'crash' command, because I can't seem to find where to install the kernel debug symbols for Proxmox. Using Debian (and Google searches), it seems the command would be this:

apt install linux-image-$(uname -r)-dbg

Or maybe this:

apt install $(uname -r)-dbg

Once this is installed, I'm supposed to be able to use the 'crash' utility along with the debug symbols to parse the 'dump.<something>' file. However, I can't seem to find what that maps to in Proxmox packages. I've tried taking a stab at it with this:

apt search pve-image-$(uname -r)

But that (and generic permutations of that) don't seem to return something that fits. Just distilling this down to this command:

apt search pve | grep -iB1 debug

This generates some candidates, maybe, but they seem specific to components of Proxmox, and not necessarily kernel debug symbols:

libpve-rs-perl-dbgsym/stable 0.6.1 amd64 debug symbols for libpve-rs-perl -- lxc-pve-dbgsym/stable 4.0.12-1 amd64 debug symbols for lxc-pve -- pve-cluster-dbgsym/stable 7.2-1 amd64 debug symbols for pve-cluster -- pve-firewall-dbgsym/stable 4.2-5 amd64 debug symbols for pve-firewall -- pve-ha-manager-dbgsym/stable 3.3-4 amd64 debug symbols for pve-ha-manager -- pve-lxc-syscalld-dbgsym/stable 1.1.1-1 amd64 debug symbols for pve-lxc-syscalld -- pve-qemu-kvm-dbg/stable 6.2.0-10 amd64 pve qemu debugging symbols -- pve-xtermjs-dbgsym/stable 4.16.0-1 amd64 debug symbols for pve-xtermjs

For reference, I'm running Proxmox from an ISO install and at this version:

root@proxmox:/var/crash/202206261001# pveversion pve-manager/7.2-5/12f1e639 (running kernel: 5.15.35-3-pve)

Any help in what to install would be appreciated, even if it's a recommendation to add a Debian repo and get the Debian kernel debug symbols. If I do have to go down this route, then it seems like Debian 11.3 (Bullseye) is running a kernel version of 5.10.x and Proxmox is running 5.15.x. I don't know if these two versions even track together or if me even suggesting this route is nonsense. I'm just trying to get to where I can use the 'crash' utility in Proxmox to track down some random, but very very frequent kernel panics.

Oh, and if it also matters, I've monitored heat in my system as well as did a complete 4 pass memtest (took over 12 hours). No issues there.

Thank you in advance for any input.
 
Hi,
I've got random crashes on my PVE host as well, and I have the same question as @tlex. I've installed both kdump and kexec, and I can manually issue a crash using this command:

sync; echo c | tee /proc/sysrq-trigger

Everything works, and the 'dmesg.<something>' and 'dump.<something>' are stored in '/var/crash'.

I can use this command to non-verbosely parse the file 'dmesg.<something>':

cd /var/crash/<date_of_crash> grep -iE 'code1|error|warning|failed|hardware|crash' dmesg.<something>

However, just like @tlex, I can't use the 'crash' command, because I can't seem to find where to install the kernel debug symbols for Proxmox. Using Debian (and Google searches), it seems the command would be this:

apt install linux-image-$(uname -r)-dbg

Or maybe this:

apt install $(uname -r)-dbg

Once this is installed, I'm supposed to be able to use the 'crash' utility along with the debug symbols to parse the 'dump.<something>' file. However, I can't seem to find what that maps to in Proxmox packages. I've tried taking a stab at it with this:

apt search pve-image-$(uname -r)

But that (and generic permutations of that) don't seem to return something that fits. Just distilling this down to this command:

apt search pve | grep -iB1 debug

This generates some candidates, maybe, but they seem specific to components of Proxmox, and not necessarily kernel debug symbols:

libpve-rs-perl-dbgsym/stable 0.6.1 amd64 debug symbols for libpve-rs-perl -- lxc-pve-dbgsym/stable 4.0.12-1 amd64 debug symbols for lxc-pve -- pve-cluster-dbgsym/stable 7.2-1 amd64 debug symbols for pve-cluster -- pve-firewall-dbgsym/stable 4.2-5 amd64 debug symbols for pve-firewall -- pve-ha-manager-dbgsym/stable 3.3-4 amd64 debug symbols for pve-ha-manager -- pve-lxc-syscalld-dbgsym/stable 1.1.1-1 amd64 debug symbols for pve-lxc-syscalld -- pve-qemu-kvm-dbg/stable 6.2.0-10 amd64 pve qemu debugging symbols -- pve-xtermjs-dbgsym/stable 4.16.0-1 amd64 debug symbols for pve-xtermjs

For reference, I'm running Proxmox from an ISO install and at this version:

root@proxmox:/var/crash/202206261001# pveversion pve-manager/7.2-5/12f1e639 (running kernel: 5.15.35-3-pve)

Any help in what to install would be appreciated, even if it's a recommendation to add a Debian repo and get the Debian kernel debug symbols. If I do have to go down this route, then it seems like Debian 11.3 (Bullseye) is running a kernel version of 5.10.x and Proxmox is running 5.15.x. I don't know if these two versions even track together or if me even suggesting this route is nonsense. I'm just trying to get to where I can use the 'crash' utility in Proxmox to track down some random, but very very frequent kernel panics.

Oh, and if it also matters, I've monitored heat in my system as well as did a complete 4 pass memtest (took over 12 hours). No issues there.

Thank you in advance for any input.
unfortunately the dbsym packages for the kernel are very big and we do not build them automatically. You'll have to build the kernel yourself and enable the pkg.pve-kernel.debug build profile. See here.
 
Last edited:
@fiona in process of troubleshooting some pve 8.2 random freezes im currently looking into enabling kernel crashdumps. I followed the steps from this thread, but I do not get a /var/crash/<DATE> file for the crash successfully triggered with sync; echo c | tee /proc/sysrq-trigger
any idea? Do I need to have the dbsym packages? and if yes, is there a easy way to do now?
Code:
root@PMX8:/var/crash# dmesg | grep crash
[    0.000000] Command line: initrd=\EFI\proxmox\6.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance crashkernel=384M-:512M
[    0.004996] crashkernel reserved: 0x0000000055000000 - 0x0000000075000000 (512 MB)
[    0.401744] Kernel command line: initrd=\EFI\proxmox\6.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance crashkernel=384M-:512M
[    1.608775] pstore: Using crash dump compression: deflate

Code:
root@PMX8:/var/crash# ls
kdump_lock  kexec_cmd

Code:
root@PMX8:/var/crash# kdump-config show
DUMP_MODE:              kdump
USE_KDUMP:              1
KDUMP_COREDIR:          /var/crash
crashkernel addr: 0x55000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-6.8.4-2-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-6.8.4-2-pve
current state:    ready to kdump


kexec command:
  /sbin/kexec -p --command-line="initrd=\EFI\proxmox.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll usbcore.nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

EDIT: I looked at the IPMI, but kvm console shutdown completly after triggering via sync; echo c | tee /proc/sysrq-trigger
asfar as I understood, ipmi should still be able to show the kernel dump etc. but yeah its just down instantly.
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!