Kernel crash issue

tlex · Nov 29, 2021

I'm having an issue where proxmox will crash from time to time..
Not being a linux specialist anyhow, I came accross this post mentioning how to enable kdump:
https://forum.proxmox.com/threads/random-proxmox-server-hang-no-vms-no-web-gui.58823/

Code:

# apt install kdump-tools

Select no for kexec reboots
Select yes for enabling kdump-tools

# $EDITOR /etc/default/grub

Add 'nmi_watchdog=1' to the end of 'GRUB_CMDLINE_LINUX_DEFAULT'

# $EDITOR /etc/default/grub.d/kdump-tools.cfg

Change 128M to 256M at the end of the line

# update-grub
# reboot
# cat /sys/kernel/kexec_crash_loaded

Now the thing is I did not "Select no for kexec reboots" and I was wondering how can I restart that configuration gui ?
I tried to uninstall and reinstall kdump-tools but I don't get the configuration menu anymore...

I know this must be really trivial but I just can't figure it.. any help appreciated

tlex · Nov 29, 2021

ok so I was able to reconfigure it but now it does not seem to be active :

Code:

kdump-config status
no crashkernel= parameter in the kernel cmdline ... failed!
Invalid symlink : /var/lib/kdump/initrd.img ... failed!
Invalid symlink : /var/lib/kdump/vmlinuz ... failed!
current state   : Not ready to kdump

and
cat /sys/kernel/kexec_crash_loaded
returns : 0

Could anyone guide me with a working tutorial on how to enable it ?
I really need to find why the system is crashing from time to time

Stoiko Ivanov · Nov 30, 2021

On a hunch - is the system maybe booted with systemd-boot (is the case if you're running with ZFS as filesystem for '/' and boot with UEFI):
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot

In that case you need to edit /etc/kernel/cmdline instead of /etc/default/grub

if that's not the case my next guess would be that you either did not reboot, or that you did not regenerate the grub-config (`update-grub`)

I hope this helps!

tlex · Nov 30, 2021

Stoiko Ivanov said:
On a hunch - is the system maybe booted with systemd-boot (is the case if you're running with ZFS as filesystem for '/' and boot with UEFI):
https://pve.proxmox.com/pve-docs/chapter-sysadmin.html#sysboot

In that case you need to edit /etc/kernel/cmdline instead of /etc/default/grub

if that's not the case my next guess would be that you either did not reboot, or that you did not regenerate the grub-config (`update-grub`)

I hope this helps!

you are right, I do run ZFS on the filesystem with UEFI.
right now, the files contains :

/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs

/etc/default/grub:

GRUB_DEFAULT=0
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="Proxmox Virtual Environment"
GRUB_CMDLINE_LINUX_DEFAULT="quiet crashkernel=128M"
GRUB_CMDLINE_LINUX="root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt nmi_watchdog=1 crashkernel=384M-:256M"

/etc/default/grub.d/kdump-tools.cfg:
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=384M-:256M"

/etc/default/kdump-tools:

USE_KDUMP=1
KDUMP_KERNEL=/var/lib/kdump/vmlinuz
#KDUMP_INITRD=/var/lib/kdump/initrd.img
KDUMP_COREDIR="/var/crash"

Any idea how it should be configured ?

tlex · Nov 30, 2021

I tried to edit as follow :

/etc/kernel/cmdline:
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:256M

Then run pve-efiboot-tool refresh and reboot

now kdump seems to run :


root@pve:~# kdump-config show
DUMP_MODE:    kdump
USE_KDUMP:    1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0xa4000000
/var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.13.19-1-pve
kdump initrd: 
current state:    ready to kdump
kexec command:
/sbin/kexec -p --command-line="initrd=\EFI\proxmox.13.19-1-pve\initrd.img-5.13.19-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" /var/lib/kdump/vmlinuz

Does that mean that next time the system crash I will find a log file under /var/crash ?

tlex · Nov 30, 2021

If I try to enable :
KDUMP_INITRD=/var/lib/kdump/initrd.img in the file /etc/default/kdump-tools

I get this error after rebooting when running kdump-config status :


Invalid symlink : /var/lib/kdump/initrd.img ... failed!
current state   : Not ready to kdump

the file /var/lib/kdump/initrd.img does exist.
Any Idea ?

fiona · Dec 1, 2021

Hi,
what is the status after running kdump-config load?

tlex · Dec 1, 2021

Fabian_E said:
Hi,
what is the status after running kdump-config load?


kdump-config status
Invalid symlink : /var/lib/kdump/initrd.img ... failed!
current state   : ready to kdump

kdump-config load
Cannot change symbolic links when kdump is loaded ... failed!

kdump-config show
DUMP_MODE:    kdump
USE_KDUMP:    1
KDUMP_COREDIR:    /var/crash
crashkernel addr: 0xa4000000
 /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-5.13.19-1-pve
kdump initrd:
current state:    ready to kdump
kexec command:
/sbin/kexec -p --command-line="initrd=\EFI\proxmox.13.19-1-pve\initrd.img-5.13.19-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll nousb ata_piix.prefer_ms_hyperv=0" /var/lib/kdump/vmlinuz

I'm definitely missing something

fiona · Dec 2, 2021

tlex said:
kdump-config load Cannot change symbolic links when kdump is loaded ... failed!

Please try to:
1. kdump-config unload
2. add the setting KDUMP_INITRD=/var/lib/kdump/initrd.img (if it's not there already).
3. kdump-config load.

tlex · Dec 2, 2021

Fabian_E said:
Please try to:
1. kdump-config unload
2. add the setting KDUMP_INITRD=/var/lib/kdump/initrd.img (if it's not there already).
3. kdump-config load.

after adding KDUMP_INITRD=/var/lib/kdump/initrd.img
and running kdump-config load
I got :

kdump-config load
Creating symlink /var/lib/kdump/vmlinuz.
kdump-tools: Generating /var/lib/kdump/initrd.img-5.13.19-1-pve
mkinitramfs: failed to determine device for /
mkinitramfs: workaround is MODULES=most, check:
grep -r MODULES /var/lib/kdump/initramfs-tools
Error please report bug on initramfs-tools
Include the output of 'mount' and 'cat /proc/mounts'
update-initramfs: failed for /var/lib/kdump/initrd.img-5.13.19-1-pve with 1.
Creating symlink /var/lib/kdump/initrd.img.
Invalid symlink : /var/lib/kdump/initrd.img ... failed!
Creating symlink /var/lib/kdump/initrd.img.
/etc/default/kdump-tools: KDUMP_INITRD does not exist: /var/lib/kdump/initrd.img ... failed!

kdump-config status
Invalid symlink : /var/lib/kdump/initrd.img ... failed!
current state   : Not ready to kdump

fiona · Dec 3, 2021

EDIT3: Better, less invasive work-around.

This indeed seems to be a bug with mkinitramfs not finding the correct modules it needs to load for ZFS. This seems to be a workaround, i.e. add MODULES=most to /usr/share/initramfs-tools/conf-hooks.d/zfs.

EDIT2: I had to increase the memory for the crashkernel to 512M (and issue proxmox-boot-tool refresh and reboot), or it wouldn't succeed after a panic.

~~EDIT: Sorry, I need to re-check this workaround. It might break booting!~~ EDIT2: After more testing, I think it was not the above workaround that made my test machine unbootable, but it happened after building the usual initrd (i.e. not the kdump one) with MODULES=dep (don't do that unless you know what you're doing

) Still, it's always good to have a backup plan!

It seems like Ubuntu made a proper fix in their initramfs-tools package, but it apparently didn't end up upstream (or at least not in Debian yet).

tlex · Dec 3, 2021

Fabian_E said:
This indeed seems to be a bug with mkinitramfs not finding the correct modules it needs to load for ZFS. This seems to be a workaround, i.e. add MODULES=most to /usr/share/initramfs-tools/conf-hooks.d/zfs.
EDIT: Sorry, I need to re-check this workaround. It might break booting!

It seems like Ubuntu made a proper fix in their initramfs-tools package, but it apparently didn't end up upstream (or at least not in Debian yet).

I'm glad that you catch it before I test

fiona · Dec 3, 2021

tlex said:
I'm glad that you catch it before I test

Sorry, didn't see you response yet (didn't reload the page). See the EDIT2 in my previous post, especially the

Fabian_E said:
Still, it's always good to have a backup plan!

part

tlex · Dec 3, 2021

Fabian_E said:
Sorry, didn't see you response yet (didn't reload the page). See the EDIT2 in my previous post, especially the

part

If I may ask, where did you changed the crashkernel to 512M ?
Is it like this :
/etc/kernel/cmdline:
~~root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:512M~~
EDIT1 :
Or should it be like this to fit all ram possibilities ?
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M

fiona · Dec 3, 2021

tlex said:
If I may ask, where did you changed the crashkernel to 512M ?
Is it like this :
/etc/kernel/cmdline:
~~root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=384M-:512M~~
EDIT1 :
Or should it be like this to fit all ram possibilities ?
root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M

Yes, you can also use a more flexible option. I simply used crashkernel=384M-:512M for my test VM with 6GB RAM, but I can't predict what value will work for you.

tlex · Dec 27, 2021

So I got my first reboot/crash since I installed kdump.
I can see the dmesg log and the dumpfile I don't know how to read the dump itself... I was wondering if someone here could help troubleshoot based on these files content ?

/var/crash/202112250917/dmesg.202112250917 in attachment

tlex · Feb 4, 2022

I'm still trying to troubleshoot why my small Proxmox server keeps crashing randomly.
I have kdump installed and running, and now I try to follow this guide in order to understand (maybe) the last crash that I had yesterday:
https://www.linuxjournal.com/content/oops-debugging-kernel-panics-0

I'm having issues with the instructions where the commands says to install the kernel source :
apt source linux-image-`uname -r`

That command was not working so I ran :
apt-get install pve-headers-`uname -r`

Now I'm at the step where they say to run the command "crash dump.201902261006 /usr/lib/debug/vmlinux-4.9.0-8-amd64"
I don't know what to replace "/usr/lib/debug/vmlinux-4.9.0-8-amd64" with. Any idea ?

Running 5.15.12-1-pve here.

Other than that, Machine was pretty stable with windows. It has ECC ram that I also checked with memtest.

I've being searching for months now trying to replace some hardware in it to randomly see if I could isolate the issue without success.
Any help would really be appreciated.

dmesg --level=err
[    1.275707] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.GP17.VGA.LCD._BCM.AFN7], AE_NOT_FOUND (20210730/psargs-330)
[    1.282792] ACPI Error: Aborting method \_SB.PCI0.GP17.VGA.LCD._BCM due to previous error (AE_NOT_FOUND) (20210730/psparse-529)
[    5.228096] ================================================================================
[    5.228098] UBSAN: invalid-load in drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:6044:84
[    5.228101] load of value 56 is not a valid value for type '_Bool'
[    5.229342] ================================================================================

dmesg --level=warn
[    0.507384] pci 0000:00:00.2: can't derive routing for PCI INT A
[    0.507386] pci 0000:00:00.2: PCI INT A: not connected
[    0.604194] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    0.604255] platform eisa.0: EISA: Cannot allocate resource for mainboard
[    0.604257] platform eisa.0: Cannot allocate resource for EISA slot 1
[    0.604259] platform eisa.0: Cannot allocate resource for EISA slot 2
[    0.604260] platform eisa.0: Cannot allocate resource for EISA slot 3
[    0.604262] platform eisa.0: Cannot allocate resource for EISA slot 4
[    0.604263] platform eisa.0: Cannot allocate resource for EISA slot 5
[    0.604265] platform eisa.0: Cannot allocate resource for EISA slot 6
[    0.604266] platform eisa.0: Cannot allocate resource for EISA slot 7
[    0.604268] platform eisa.0: Cannot allocate resource for EISA slot 8

[    1.277132] 
               Initialized Local Variables for Method [_BCM]:
[    1.278543]   Local0: 00000000b3d4e529 <Obj>           Integer 00000000000000FF
[    1.279261]   Local1: 00000000bd04d367 <Obj>           Integer 0000000000000000

[    1.280641] Initialized Arguments for Method [_BCM]:  (1 arguments defined for method invocation)
[    1.281360]   Arg0:   000000006b1377a8 <Obj>           Integer 0000000000000064

[    1.368792] usb: port power management may be unreliable
[    1.775913] ata2.00: supports DRM functions and may not be fully accessible
[    1.778032] ata3.00: supports DRM functions and may not be fully accessible
[    1.784474] ata2.00: supports DRM functions and may not be fully accessible
[    1.785187] ata3.00: supports DRM functions and may not be fully accessible
[    3.367194] spl: loading out-of-tree module taints kernel.
[    3.369157] znvpair: module license 'CDDL' taints kernel.
[    3.370272] Disabling lock debugging due to kernel taint
[    4.086337] systemd-journald[1016]: File /var/log/journal/6646a341960f4315875730a404b9f973/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    4.357298] amdgpu 0000:0a:00.0: amdgpu: PSP runtime database doesn't exist
[    5.227268] amdgpu: SRAT table not found
[    5.228104] CPU: 3 PID: 1158 Comm: systemd-udevd Tainted: P           O      5.15.12-1-pve #1
[    5.228107] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570M Pro4, BIOS P3.60 08/11/2021
[    5.228109] Call Trace:
[    5.228112]  <TASK>
[    5.228114]  dump_stack_lvl+0x4a/0x5f
[    5.228121]  dump_stack+0x10/0x12
[    5.228122]  ubsan_epilogue+0x9/0x45
[    5.228124]  __ubsan_handle_load_invalid_value.cold+0x44/0x49
[    5.228127]  create_stream_for_sink.cold+0x5d/0xbb [amdgpu]
[    5.228311]  create_validate_stream_for_sink+0x59/0x150 [amdgpu]
[    5.228467]  amdgpu_dm_connector_mode_valid+0x54/0x1a0 [amdgpu]
[    5.228613]  ? drm_connector_list_update+0x186/0x1f0 [drm]
[    5.228630]  drm_connector_mode_valid+0x3b/0x60 [drm_kms_helper]
[    5.228643]  drm_helper_probe_single_connector_modes+0x3b5/0x880 [drm_kms_helper]
[    5.228652]  ? krealloc+0x87/0xd0
[    5.228656]  drm_client_modeset_probe+0x2bf/0x1600 [drm]
[    5.228671]  ? ktime_get_mono_fast_ns+0x52/0xb0
[    5.228675]  __drm_fb_helper_initial_config_and_unlock+0x49/0x500 [drm_kms_helper]
[    5.228685]  ? mutex_lock+0x13/0x40
[    5.228688]  drm_fb_helper_initial_config+0x43/0x50 [drm_kms_helper]
[    5.228697]  amdgpu_fbdev_init+0xd8/0x110 [amdgpu]
[    5.228804]  amdgpu_device_init.cold+0x193a/0x1d51 [amdgpu]
[    5.228949]  ? pci_read+0x53/0x60
[    5.228953]  ? pci_read_config_word+0x27/0x40
[    5.228956]  ? do_pci_enable_device.part.0+0xbc/0xe0
[    5.228959]  amdgpu_driver_load_kms+0x6d/0x330 [amdgpu]
[    5.229062]  amdgpu_pci_probe+0x11e/0x1a0 [amdgpu]
[    5.229165]  local_pci_probe+0x4b/0x90
[    5.229168]  pci_device_probe+0x115/0x1f0
[    5.229170]  really_probe+0x21e/0x420
[    5.229174]  __driver_probe_device+0x115/0x190
[    5.229176]  driver_probe_device+0x23/0xc0
[    5.229178]  __driver_attach+0xbd/0x1d0
[    5.229180]  ? __device_attach_driver+0x110/0x110
[    5.229182]  bus_for_each_dev+0x7e/0xc0
[    5.229184]  driver_attach+0x1e/0x20
[    5.229186]  bus_add_driver+0x135/0x200
[    5.229188]  driver_register+0x91/0xf0
[    5.229190]  __pci_register_driver+0x68/0x70
[    5.229192]  amdgpu_init+0x7c/0x1000 [amdgpu]
[    5.229284]  ? 0xffffffffc30aa000
[    5.229286]  do_one_initcall+0x48/0x1d0
[    5.229290]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[    5.229293]  do_init_module+0x62/0x2a0
[    5.229296]  load_module+0x2711/0x2b10
[    5.229299]  __do_sys_finit_module+0xbf/0x120
[    5.229301]  __x64_sys_finit_module+0x1a/0x20
[    5.229303]  do_syscall_64+0x5c/0xc0
[    5.229305]  ? syscall_exit_to_user_mode+0x27/0x50
[    5.229307]  ? do_syscall_64+0x69/0xc0
[    5.229309]  ? do_syscall_64+0x69/0xc0
[    5.229310]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    5.229313] RIP: 0033:0x7f299496d9b9
[    5.229315] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
[    5.229318] RSP: 002b:00007ffda64da4a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    5.229322] RAX: ffffffffffffffda RBX: 000056031321ce00 RCX: 00007f299496d9b9
[    5.229324] RDX: 0000000000000000 RSI: 00007f2994af8e2d RDI: 0000000000000019
[    5.229325] RBP: 0000000000020000 R08: 0000000000000000 R09: 00005603131d9b60
[    5.229327] R10: 0000000000000019 R11: 0000000000000246 R12: 00007f2994af8e2d
[    5.229329] R13: 0000000000000000 R14: 0000560313217050 R15: 000056031321ce00
[    5.229330]  </TASK>
[ 9306.003685] hrtimer: interrupt took 2360 ns

cat dmesg.202202032254 | grep -i crash
[    0.000000] Command line: initrd=\EFI\proxmox\5.15.12-1-pve\initrd.img-5.15.12-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M
[    0.005387] Reserving 768MB of memory at 1216MB for crashkernel (System RAM: 64893MB)
[    0.123409] Kernel command line: initrd=\EFI\proxmox\5.15.12-1-pve\initrd.img-5.15.12-1-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs nmi_watchdog=1 crashkernel=0M-2G:128M,2G-6G:256M,6G-8G:512M,8G-:768M
[    4.245382] pstore: Using crash dump compression: deflate

cat dmesg.202202032254 | grep -i err
[    0.458039] ACPI: Using IOAPIC for interrupt routing
[    0.479973] ACPI: PCI: Interrupt link LNKA configured for IRQ 0
[    0.480017] ACPI: PCI: Interrupt link LNKB configured for IRQ 0
[    0.480054] ACPI: PCI: Interrupt link LNKC configured for IRQ 0
[    0.480100] ACPI: PCI: Interrupt link LNKD configured for IRQ 0
[    0.480142] ACPI: PCI: Interrupt link LNKE configured for IRQ 0
[    0.480176] ACPI: PCI: Interrupt link LNKF configured for IRQ 0
[    0.480211] ACPI: PCI: Interrupt link LNKG configured for IRQ 0
[    0.480245] ACPI: PCI: Interrupt link LNKH configured for IRQ 0
[    0.511330] AMD-Vi: Interrupt remapping enabled
[    1.163763] acpi_cpufreq: overriding BIOS provided _PSD data
[    1.164826] RAS: Correctable Errors collector initialized.
[    1.275539] ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.GP17.VGA.LCD._BCM.AFN7], AE_NOT_FOUND (20210730/psargs-330)
[    1.280476] ACPI Error: Aborting method \_SB.PCI0.GP17.VGA.LCD._BCM due to previous error (AE_NOT_FOUND) (20210730/psparse-529)
[    1.336414] igb 0000:05:00.0: Using MSI-X interrupts. 2 rx queue(s), 2 tx queue(s)
[    4.452081] EDAC MC0: Giving out device to module amd64_edac controller F17h_M60h: DEV 0000:00:18.3 (INTERRUPT)
[11266.623843] hrtimer: interrupt took 6600 ns
[21304.106420]  sysvec_apic_timer_interrupt+0x7c/0x90
[21304.108221]  asm_sysvec_apic_timer_interrupt+0x12/0x2

pveversion -v
proxmox-ve: 7.1-1 (running kernel: 5.15.12-1-pve)
pve-manager: 7.1-10 (running version: 7.1-10/6ddebafe)
pve-kernel-5.15: 7.1-8
pve-kernel-helper: 7.1-8
pve-kernel-5.13: 7.1-6
pve-kernel-5.15.12-1-pve: 5.15.12-3
pve-kernel-5.15.7-1-pve: 5.15.7-1
pve-kernel-5.13.19-3-pve: 5.13.19-7
pve-kernel-5.13.19-2-pve: 5.13.19-4
ceph-fuse: 15.2.15-pve1
corosync: 3.1.5-pve2
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown: residual config
ifupdown2: 3.1.0-1+pmx3
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.22-pve2
libproxmox-acme-perl: 1.4.1
libproxmox-backup-qemu0: 1.2.0-1
libpve-access-control: 7.1-6
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.1-2
libpve-guest-common-perl: 4.0-3
libpve-http-server-perl: 4.1-1
libpve-storage-perl: 7.0-15
libqb0: 1.0.5-1
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 4.0.11-1
lxcfs: 4.0.11-pve1
novnc-pve: 1.3.0-1
proxmox-backup-client: 2.1.4-1
proxmox-backup-file-restore: 2.1.4-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.4-5
pve-cluster: 7.1-3
pve-container: 4.1-3
pve-docs: 7.1-2
pve-edk2-firmware: 3.20210831-2
pve-firewall: 4.2-5
pve-firmware: 3.3-4
pve-ha-manager: 3.3-3
pve-i18n: 2.6-2
pve-qemu-kvm: 6.1.0-3
pve-xtermjs: 4.16.0-1
qemu-server: 7.1-4
smartmontools: 7.2-pve2
spiceterm: 3.2-2
swtpm: 0.7.0~rc1+2
vncterm: 1.7-1
zfsutils-linux: 2.1.2-pve1

RussellNS · Jun 26, 2022

tlex said:
Now I'm at the step where they say to run the command "crash dump.201902261006 /usr/lib/debug/vmlinux-4.9.0-8-amd64"
I don't know what to replace "/usr/lib/debug/vmlinux-4.9.0-8-amd64" with. Any idea ?

I've got random crashes on my PVE host as well, and I have the same question as @tlex. I've installed both kdump and kexec, and I can manually issue a crash using this command:

sync; echo c | tee /proc/sysrq-trigger

Everything works, and the 'dmesg.<something>' and 'dump.<something>' are stored in '/var/crash'.

I can use this command to non-verbosely parse the file 'dmesg.<something>':

cd /var/crash/<date_of_crash>
grep -iE 'code1|error|warning|failed|hardware|crash' dmesg.<something>

However, just like @tlex, I can't use the 'crash' command, because I can't seem to find where to install the kernel debug symbols for Proxmox. Using Debian (and Google searches), it seems the command would be this:

apt install linux-image-$(uname -r)-dbg

Or maybe this:

apt install $(uname -r)-dbg

Once this is installed, I'm supposed to be able to use the 'crash' utility along with the debug symbols to parse the 'dump.<something>' file. However, I can't seem to find what that maps to in Proxmox packages. I've tried taking a stab at it with this:

apt search pve-image-$(uname -r)

But that (and generic permutations of that) don't seem to return something that fits. Just distilling this down to this command:

apt search pve | grep -iB1 debug

This generates some candidates, maybe, but they seem specific to components of Proxmox, and not necessarily kernel debug symbols:

libpve-rs-perl-dbgsym/stable 0.6.1 amd64
  debug symbols for libpve-rs-perl
--
lxc-pve-dbgsym/stable 4.0.12-1 amd64
  debug symbols for lxc-pve
--
pve-cluster-dbgsym/stable 7.2-1 amd64
  debug symbols for pve-cluster
--
pve-firewall-dbgsym/stable 4.2-5 amd64
  debug symbols for pve-firewall
--
pve-ha-manager-dbgsym/stable 3.3-4 amd64
  debug symbols for pve-ha-manager
--
pve-lxc-syscalld-dbgsym/stable 1.1.1-1 amd64
  debug symbols for pve-lxc-syscalld
--
pve-qemu-kvm-dbg/stable 6.2.0-10 amd64
  pve qemu debugging symbols
--
pve-xtermjs-dbgsym/stable 4.16.0-1 amd64
  debug symbols for pve-xtermjs

For reference, I'm running Proxmox from an ISO install and at this version:

root@proxmox:/var/crash/202206261001# pveversion
pve-manager/7.2-5/12f1e639 (running kernel: 5.15.35-3-pve)

Any help in what to install would be appreciated, even if it's a recommendation to add a Debian repo and get the Debian kernel debug symbols. If I do have to go down this route, then it seems like Debian 11.3 (Bullseye) is running a kernel version of 5.10.x and Proxmox is running 5.15.x. I don't know if these two versions even track together or if me even suggesting this route is nonsense. I'm just trying to get to where I can use the 'crash' utility in Proxmox to track down some random, but very very frequent kernel panics.

Oh, and if it also matters, I've monitored heat in my system as well as did a complete 4 pass memtest (took over 12 hours). No issues there.

Thank you in advance for any input.

fiona · Jun 27, 2022

Hi,

RussellNS said:
I've got random crashes on my PVE host as well, and I have the same question as @tlex. I've installed both kdump and kexec, and I can manually issue a crash using this command:

sync; echo c | tee /proc/sysrq-trigger

Everything works, and the 'dmesg.<something>' and 'dump.<something>' are stored in '/var/crash'.

I can use this command to non-verbosely parse the file 'dmesg.<something>':

cd /var/crash/<date_of_crash> grep -iE 'code1|error|warning|failed|hardware|crash' dmesg.<something>

However, just like @tlex, I can't use the 'crash' command, because I can't seem to find where to install the kernel debug symbols for Proxmox. Using Debian (and Google searches), it seems the command would be this:

apt install linux-image-$(uname -r)-dbg

Or maybe this:

apt install $(uname -r)-dbg

Once this is installed, I'm supposed to be able to use the 'crash' utility along with the debug symbols to parse the 'dump.<something>' file. However, I can't seem to find what that maps to in Proxmox packages. I've tried taking a stab at it with this:

apt search pve-image-$(uname -r)

But that (and generic permutations of that) don't seem to return something that fits. Just distilling this down to this command:

apt search pve | grep -iB1 debug

This generates some candidates, maybe, but they seem specific to components of Proxmox, and not necessarily kernel debug symbols:

libpve-rs-perl-dbgsym/stable 0.6.1 amd64 debug symbols for libpve-rs-perl -- lxc-pve-dbgsym/stable 4.0.12-1 amd64 debug symbols for lxc-pve -- pve-cluster-dbgsym/stable 7.2-1 amd64 debug symbols for pve-cluster -- pve-firewall-dbgsym/stable 4.2-5 amd64 debug symbols for pve-firewall -- pve-ha-manager-dbgsym/stable 3.3-4 amd64 debug symbols for pve-ha-manager -- pve-lxc-syscalld-dbgsym/stable 1.1.1-1 amd64 debug symbols for pve-lxc-syscalld -- pve-qemu-kvm-dbg/stable 6.2.0-10 amd64 pve qemu debugging symbols -- pve-xtermjs-dbgsym/stable 4.16.0-1 amd64 debug symbols for pve-xtermjs

For reference, I'm running Proxmox from an ISO install and at this version:

root@proxmox:/var/crash/202206261001# pveversion pve-manager/7.2-5/12f1e639 (running kernel: 5.15.35-3-pve)

Any help in what to install would be appreciated, even if it's a recommendation to add a Debian repo and get the Debian kernel debug symbols. If I do have to go down this route, then it seems like Debian 11.3 (Bullseye) is running a kernel version of 5.10.x and Proxmox is running 5.15.x. I don't know if these two versions even track together or if me even suggesting this route is nonsense. I'm just trying to get to where I can use the 'crash' utility in Proxmox to track down some random, but very very frequent kernel panics.

Oh, and if it also matters, I've monitored heat in my system as well as did a complete 4 pass memtest (took over 12 hours). No issues there.

Thank you in advance for any input.

unfortunately the dbsym packages for the kernel are very big and we do not build them automatically. You'll have to build the kernel yourself and enable the pkg.pve-kernel.debug build profile. See here.

jsterr · May 3, 2024

@fiona in process of troubleshooting some pve 8.2 random freezes im currently looking into enabling kernel crashdumps. I followed the steps from this thread, but I do not get a /var/crash/<DATE> file for the crash successfully triggered with sync; echo c | tee /proc/sysrq-trigger
any idea? Do I need to have the dbsym packages? and if yes, is there a easy way to do now?

Code:

root@PMX8:/var/crash# dmesg | grep crash
[    0.000000] Command line: initrd=\EFI\proxmox\6.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance crashkernel=384M-:512M
[    0.004996] crashkernel reserved: 0x0000000055000000 - 0x0000000075000000 (512 MB)
[    0.401744] Kernel command line: initrd=\EFI\proxmox\6.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance crashkernel=384M-:512M
[    1.608775] pstore: Using crash dump compression: deflate

Code:

root@PMX8:/var/crash# ls
kdump_lock  kexec_cmd

Code:

root@PMX8:/var/crash# kdump-config show
DUMP_MODE:              kdump
USE_KDUMP:              1
KDUMP_COREDIR:          /var/crash
crashkernel addr: 0x55000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-6.8.4-2-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-6.8.4-2-pve
current state:    ready to kdump


kexec command:
  /sbin/kexec -p --command-line="initrd=\EFI\proxmox.8.4-2-pve\initrd.img-6.8.4-2-pve root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on iommu=pt pcie_aspm.policy=performance reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll usbcore.nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

EDIT: I looked at the IPMI, but kvm console shutdown completly after triggering via sync; echo c | tee /proc/sysrq-trigger
asfar as I understood, ipmi should still be able to show the kernel dump etc. but yeah its just down instantly.

Kernel crash issue

Active Member

Active Member

Proxmox Staff Member

Active Member

Active Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Proxmox Staff Member

Active Member

Attachments

Active Member

Member

Proxmox Staff Member

Renowned Member

We value your privacy