Random crashes without logs - kdump doesn't appear to be working

skybolt_1

New Member
Aug 2, 2024
16
4
3
I have a Dell T430 server that I am using for my homelab, running PVE 9.2.3. Ever since installing an Intel ARC 310 graphics card for passthrough into my primary Docker host about two months ago, I have been getting intermittent crashes which don't result in any journald entries. I've tried the following:
  • Ran healthchecks on the hardware
    • Dell's built-in utility detected a "PCIe - Training Error: Link Degraded maxWidth x8, negWidth x2, slot 4" which was my PERC 310 HBA card. I removed, cleaned the contacts of, and reinstalled the card, and that error has not yet returned.
  • Verified that X2APIC Mode, SR-IOV Global Enable, I/OAT DMA Engine were all enabled in the BIOS
  • Installed an HDMI display emulator on the ARC
  • Attempted to enable kdump
The last step is where I am stuck. I've gone through a number of threads and configured kdump such that it appears to be working properly:

Code:
root@t430-pve:~# kdump-config show
DUMP_MODE:              kdump
USE_KDUMP:              1
KDUMP_COREDIR:          /var/crash
crashkernel addr: 0x4a000000
0x1fff000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-7.0.12-1-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-7.0.12-1-pve
current state:    ready to kdump

crashkernel suggested size: 905M

kexec command:
  /sbin/kexec -p -s --command-line="BOOT_IMAGE=/vmlinuz-7.0.12-1-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs nomodeset noresume mitigations=auto intel_iommu=on iommu=pt fsck.mode=auto fsck.repair=yes init_on_alloc=1 init_on_free=1 hpet=disable clocksource=tsc tsc=reliable nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll usbcore.nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

Code:
 kdump-tools.service - Kernel crash dump capture service
     Loaded: loaded (/usr/lib/systemd/system/kdump-tools.service; enabled; preset: enabled)
     Active: active (exited) since Mon 2026-06-29 14:43:34 EDT; 27min ago
 Invocation: b05b4f02b384451e99aa7005cb1d16b6
    Process: 2612 ExecStartPre=/usr/share/kdump-tools/kdump_mem_estimator (code=exited, status=0/SUCCESS)
    Process: 2619 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
   Main PID: 2619 (code=exited, status=0/SUCCESS)
   Mem peak: 35M
        CPU: 1.050s

Jun 29 14:43:33 t430-pve systemd[1]: Starting kdump-tools.service - Kernel crash dump capture service...
Jun 29 14:43:33 t430-pve kdump-tools[2619]: Starting kdump-tools:
Jun 29 14:43:33 t430-pve kdump-tools[2623]: Creating symlink /var/lib/kdump/vmlinuz.
Jun 29 14:43:33 t430-pve kdump-tools[2623]: Creating symlink /var/lib/kdump/initrd.img.
Jun 29 14:43:34 t430-pve kdump-tools[2623]: loaded kdump kernel.
Jun 29 14:43:34 t430-pve systemd[1]: Finished kdump-tools.service - Kernel crash dump capture service.

However, after a "real" crash, no subfolders with crash data showed up under /var/crash. I attempted to trigger a crash using echo c > /proc/sysrq-trigger which did crash the machine, but again, no subfolders appeared. I tried crashing the machine a second time, and checked the system console. Here's what I found:
crash result.png
The console stops responsing to keyboard input at this point. It appears that the failure of the pool import may be contributing to the lack of crash data. Any suggestions on how to resolve this?

Thanks!
 
Hi @skybolt_1

thanks for posting on the forum!

Can you please post the content of the following files:
  • /etc/default/grub.d/kdump-tools.cfg
  • /etc/default/kdump-tools
  • /proc/cmdline
Did you update GRUB using proxmox-boot-tool refresh after enabling and configuring everything?

Yours sincerely
Jonas
 
Sure:
/etc/default/grub.d/kdump-tools.cfg
Code:
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"

/etc/default/kdump-tools
Code:
USE_KDUMP=1
KDUMP_KERNEL=/var/lib/kdump/vmlinuz
KDUMP_INITRD=/var/lib/kdump/initrd.img
KDUMP_COREDIR="/var/crash"

/proc/cmdline
Code:
ROOT_IMAGE=/vmlinuz-7.0.12-1-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs nomodeset noresume mitigations=auto intel_iommu=on iommu=pt fsck.mode=auto fsck.repair=yes init_on_alloc=1 init_on_free=1 hpet=disable clocksource=tsc tsc=reliable nmi_watchdog=1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

I did update grub by running proxmox-boot-tool refresh after configuring this.

Thank you!
 
Might be worth a try to fix the crashkernel size to 1024M.
The parameter as is should technically pick the right size based on system memory but it might still fail.

Apart from that you might benefit from adding KDUMP_DUMP_DMESG=1 to /etc/default/kdump-tools
This might provide some additional insight.
 
  • Like
Reactions: skybolt_1
I made the suggested changes, re-ran proxmox-boot-tool refresh, rebooted, then ran echo c > /proc/sysrq-trigger which resulted in the same "failed to import pool rpool" error in the screenshot I shared in my first post.
 
If your "rpool" is connected to the Dell PERC H310, then I would try adding the kernel parameter "pcie_aspm=off" for a test.

Those LSI-based HBAs used to have massive issues with ASPM, resulting in basically an unusable card in the worst case.
 
Unfortunately (fortunately?) rpool is totally independent of the PERC, it is two Samsung SATA SSDs connected to the on-board SATA controller in a ZFS mirror.
 
Are you sure, that the SSDs are running the latest firmware?
I specifically remember some (FW-related) issues with certain series of Samsung SSDs. Had severe ATA-related error floods some time ago with Samsung Enterprise SSDs (like PM863, SM863 and PM883), on an Intel C612 SATA controller. The disks themselfs turned out to be 100% fine in the end. However, I was very close to scrapping them, due to the seeming failures.
 
Last edited:
SSDs (or the chipset?) seemed to have issues with NCQ and Trim.
Try the Proxmox Kernel parameters "libata.noncq libata.noncqtrim" .
 
Last edited:
Are you sure, that the SSDs are running the latest firmware?
I specifically remember some (FW-related) issues with certain series of Samsung SSDs. Had severe ATA-related error floods some time ago with Samsung Enterprise SSDs (like PM863, SM863 and PM883), on an Intel C612 SATA controller. The disks themselfs turned out to be 100% fine in the end. However, I was very close to scrapping them, due to the seeming failures.
Yes, I just pulled the drives and checked them with Samsung Magician, they are both at the latest release. These are consumer drives, one is an 850 EVO, the other is an 840 EVO. They are only used for the PVE OS, all of my active workloads are on a separate pair of SAS Intel enterprise SSDs.

If I set libata.noncq / libata.noncqtrim, those should not negatively affect my SAS SSDs, correct?