Random crashes without logs - kdump doesn't appear to be working

skybolt_1 · Jun 29, 2026

I have a Dell T430 server that I am using for my homelab, running PVE 9.2.3. Ever since installing an Intel ARC 310 graphics card for passthrough into my primary Docker host about two months ago, I have been getting intermittent crashes which don't result in any journald entries. I've tried the following:

Ran healthchecks on the hardware
- Dell's built-in utility detected a "PCIe - Training Error: Link Degraded maxWidth x8, negWidth x2, slot 4" which was my PERC 310 HBA card. I removed, cleaned the contacts of, and reinstalled the card, and that error has not yet returned.
Verified that X2APIC Mode, SR-IOV Global Enable, I/OAT DMA Engine were all enabled in the BIOS
Installed an HDMI display emulator on the ARC
Attempted to enable kdump

The last step is where I am stuck. I've gone through a number of threads and configured kdump such that it appears to be working properly:

Code:

root@t430-pve:~# kdump-config show
DUMP_MODE:              kdump
USE_KDUMP:              1
KDUMP_COREDIR:          /var/crash
crashkernel addr: 0x4a000000
0x1fff000000
   /var/lib/kdump/vmlinuz: symbolic link to /boot/vmlinuz-7.0.12-1-pve
kdump initrd:
   /var/lib/kdump/initrd.img: symbolic link to /var/lib/kdump/initrd.img-7.0.12-1-pve
current state:    ready to kdump

crashkernel suggested size: 905M

kexec command:
  /sbin/kexec -p -s --command-line="BOOT_IMAGE=/vmlinuz-7.0.12-1-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs nomodeset noresume mitigations=auto intel_iommu=on iommu=pt fsck.mode=auto fsck.repair=yes init_on_alloc=1 init_on_free=1 hpet=disable clocksource=tsc tsc=reliable nmi_watchdog=1 reset_devices systemd.unit=kdump-tools-dump.service nr_cpus=1 irqpoll usbcore.nousb" --initrd=/var/lib/kdump/initrd.img /var/lib/kdump/vmlinuz

Code:

 kdump-tools.service - Kernel crash dump capture service
     Loaded: loaded (/usr/lib/systemd/system/kdump-tools.service; enabled; preset: enabled)
     Active: active (exited) since Mon 2026-06-29 14:43:34 EDT; 27min ago
 Invocation: b05b4f02b384451e99aa7005cb1d16b6
    Process: 2612 ExecStartPre=/usr/share/kdump-tools/kdump_mem_estimator (code=exited, status=0/SUCCESS)
    Process: 2619 ExecStart=/etc/init.d/kdump-tools start (code=exited, status=0/SUCCESS)
   Main PID: 2619 (code=exited, status=0/SUCCESS)
   Mem peak: 35M
        CPU: 1.050s

Jun 29 14:43:33 t430-pve systemd[1]: Starting kdump-tools.service - Kernel crash dump capture service...
Jun 29 14:43:33 t430-pve kdump-tools[2619]: Starting kdump-tools:
Jun 29 14:43:33 t430-pve kdump-tools[2623]: Creating symlink /var/lib/kdump/vmlinuz.
Jun 29 14:43:33 t430-pve kdump-tools[2623]: Creating symlink /var/lib/kdump/initrd.img.
Jun 29 14:43:34 t430-pve kdump-tools[2623]: loaded kdump kernel.
Jun 29 14:43:34 t430-pve systemd[1]: Finished kdump-tools.service - Kernel crash dump capture service.

However, after a "real" crash, no subfolders with crash data showed up under /var/crash. I attempted to trigger a crash using echo c > /proc/sysrq-trigger which did crash the machine, but again, no subfolders appeared. I tried crashing the machine a second time, and checked the system console. Here's what I found:

The console stops responsing to keyboard input at this point. It appears that the failure of the pool import may be contributing to the lack of crash data. Any suggestions on how to resolve this?

Thanks!

j.theisen · Jun 30, 2026

Hi @skybolt_1

thanks for posting on the forum!

Can you please post the content of the following files:

/etc/default/grub.d/kdump-tools.cfg
/etc/default/kdump-tools
/proc/cmdline

Did you update GRUB using proxmox-boot-tool refresh after enabling and configuring everything?

Yours sincerely
Jonas

skybolt_1 · Jun 30, 2026

Sure:
/etc/default/grub.d/kdump-tools.cfg

Code:

GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M"

/etc/default/kdump-tools

Code:

USE_KDUMP=1
KDUMP_KERNEL=/var/lib/kdump/vmlinuz
KDUMP_INITRD=/var/lib/kdump/initrd.img
KDUMP_COREDIR="/var/crash"

/proc/cmdline

Code:

ROOT_IMAGE=/vmlinuz-7.0.12-1-pve root=ZFS=/ROOT/pve-1 ro root=ZFS=rpool/ROOT/pve-1 boot=zfs nomodeset noresume mitigations=auto intel_iommu=on iommu=pt fsck.mode=auto fsck.repair=yes init_on_alloc=1 init_on_free=1 hpet=disable clocksource=tsc tsc=reliable nmi_watchdog=1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M

I did update grub by running proxmox-boot-tool refresh after configuring this.

Thank you!

j.theisen · Jun 30, 2026

Might be worth a try to fix the crashkernel size to 1024M.
The parameter as is should technically pick the right size based on system memory but it might still fail.

Apart from that you might benefit from adding KDUMP_DUMP_DMESG=1 to /etc/default/kdump-tools
This might provide some additional insight.

skybolt_1 · Jun 30, 2026

I made the suggested changes, re-ran proxmox-boot-tool refresh, rebooted, then ran echo c > /proc/sysrq-trigger which resulted in the same "failed to import pool rpool" error in the screenshot I shared in my first post.

celemine1gig · Jul 1, 2026

If your "rpool" is connected to the Dell PERC H310, then I would try adding the kernel parameter "pcie_aspm=off" for a test.

Those LSI-based HBAs used to have massive issues with ASPM, resulting in basically an unusable card in the worst case.

skybolt_1 · Jul 1, 2026

Unfortunately (fortunately?) rpool is totally independent of the PERC, it is two Samsung SATA SSDs connected to the on-board SATA controller in a ZFS mirror.

celemine1gig · Jul 1, 2026

Are you sure, that the SSDs are running the latest firmware?
I specifically remember some (FW-related) issues with certain series of Samsung SSDs. Had severe ATA-related error floods some time ago with Samsung Enterprise SSDs (like PM863, SM863 and PM883), on an Intel C612 SATA controller. The disks themselfs turned out to be 100% fine in the end. However, I was very close to scrapping them, due to the seeming failures.

celemine1gig · Jul 1, 2026

SSDs (or the chipset?) seemed to have issues with NCQ and Trim.
Edit (mistake in the parameter fixed): Try the Proxmox Kernel parameters "libata.force=noncq libata.force=noncqtrim" .

skybolt_1 · Jul 2, 2026

celemine1gig said:
Are you sure, that the SSDs are running the latest firmware?
I specifically remember some (FW-related) issues with certain series of Samsung SSDs. Had severe ATA-related error floods some time ago with Samsung Enterprise SSDs (like PM863, SM863 and PM883), on an Intel C612 SATA controller. The disks themselfs turned out to be 100% fine in the end. However, I was very close to scrapping them, due to the seeming failures.

Yes, I just pulled the drives and checked them with Samsung Magician, they are both at the latest release. These are consumer drives, one is an 850 EVO, the other is an 840 EVO. They are only used for the PVE OS, all of my active workloads are on a separate pair of SAS Intel enterprise SSDs.

If I set libata.noncq / libata.noncqtrim, those should not negatively affect my SAS SSDs, correct?

celemine1gig · Jul 2, 2026

Interesting!
The "840 EVO" already seems to have a quirk entry in the Linux libata Kernel-Code.
See here:
https://git.kernel.org/pub/scm/linu...tree/drivers/ata/libata-core.c?h=v7.1.2#n4308
Similar entries also exist for the 850 EVO (denoted as "850*").
See here:
https://git.kernel.org/pub/scm/linu...tree/drivers/ata/libata-core.c?h=v7.1.2#n4313

So, in theory, at least the "NCQ Trim" should already be disabled, anyway.
And to answer your question, ATA and SAS are using different subsystems in the Kernel. ATA vs. SCSI.

skybolt_1 · Jul 2, 2026

OK, thanks! This system has been running well for about two years using these drives so I would have been somewhat surprised if there was an issue w/ support.

I think for my next step I am going to try a fresh install and see if that has any impact on the issue.

celemine1gig · Jul 2, 2026

What does S.M.A.R.T say, concerning those Samsungs? ZFS for 2 years on consumer drives, could have taken its toll, depending on how you used it.
Edit: If you already tried the kernel parameters, I had a mistake in there. See edit above, in the original posting.

skybolt_1 · Jul 2, 2026

No issues, tested them with Magician. They never had VMs running on them, just the PVE host, stored some ISOs on there as well.

Random crashes without logs - kdump doesn't appear to be working

skybolt_1

New Member

j.theisen

Active Member

skybolt_1

New Member

j.theisen

Active Member

skybolt_1

New Member

celemine1gig

Renowned Member

skybolt_1

New Member

celemine1gig

Renowned Member

celemine1gig

Renowned Member

skybolt_1

New Member

celemine1gig

Renowned Member

skybolt_1

New Member

celemine1gig

Renowned Member

skybolt_1

New Member

We value your privacy