Boot failure on NVMe drive on kernel 7.0.12-1

CRCinAU

Renowned Member
May 4, 2020
204
61
68
crc.id.au
I'm in the middle of troubleshooting this. I was doing a smart test on my nvme drive (Seagate FireCuda 510 SSD ZP500GM30001) when the drive went unresponsive.

dmesg showed errors like:
Code:
  [  337.272885] nvme nvme0: I/O tag 757 (12f5) opcode 0x2 (I/O Cmd) QID 2 timeout, aborting req_op:READ(0) size:4096
  [  337.272901] nvme nvme0: I/O tag 681 (82a9) opcode 0x2 (I/O Cmd) QID 6 timeout, aborting req_op:READ(0) size:4096
  [  337.272910] nvme nvme0: I/O tag 896 (2380) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:4096
  [  339.060846] nvme nvme0: I/O tag 70 (3046) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:110592
  [  367.477530] nvme nvme0: I/O tag 757 (12f5) opcode 0x2 (I/O Cmd) QID 2 timeout, reset controller

At this point, the system is just about dead in the water.

I tried quite a bit of troubleshooting, but even after a reboot, sometimes the NVMe drive wouldn't even be detected in the BIOS... The mainboards HDD activity light was just flashing at ~1s on, ~1s off.

After hard powering the system off, resetting the BIOS, and booting into kernel 7.0.6, everything seemed to work just fine. The BIOS NVMe self-test ran without issue. There's no data problems recorded in a `zfs scrub` of my zpool.

I did stumble across this: https://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg52267.html

I'm not sure if any of the problems are related. I've also never really seen anything like this before - so I'm kind of lost for ideas in troubleshooting.

Anyone seen something similar, or got any ideas on this one?
 
Hi @CRCinAU

thanks for posting on the forum!

I am currently not aware of such issues with the new 7.0.12 kernel and also i am running a similar system (X870E based) as mentioned in the other link which runs fine.

Can you please confirm that your system is in fact an X870E based one?
If not please specify your hardware stats f.ex. using dmidecode and post them here.

The fact that the drive wasn't even recognized at the BIOS level suggest a deeper issue.
Is your BIOS up to date?
Could the drive be overheating? You can check the current temperature using smartctl -a /dev/<your-drive>

Yours sincerely
Jonas
 
Thanks - this mainboard is an MSI X570-A PRO (MS-7C37) with BIOS version H.P1.

Looking here, it seems to be a couple of versions out of date: https://www.msi.com/Motherboard/X570-A-PRO/support

I have another system with only an NVMe drive, that's a Gigabyte mainboard, and that seems to run 7.0.12 without any issues.

The drive doesn't seem to be having any issues at all. When I hard power off, power it on again, all temps and tests work fine until I boot it with kernel 7.0.12 again.

I need to do some more troubleshooting tomorrow when I'm back home - but right now, I have a queued up change to add `iommu=pt` in the kernel command line. I have that there before, but removed it in testing something else. Would be good to rule out the removal of that being an issue. I just don't want to reboot it now and potentially lose access to it as a running system while I'm not there to kick it if it breaks again :)
 
Last edited:
  • Like
Reactions: j.theisen
could you post a full boot log under both kernels?
 
Hi @fabian ,

Thanks for the reply. I'm actually starting to think that it was the removal of the `iommu=pt` from the kernel command line that was causing the failure. This system has been built for quite some time - and that was in there for a while - and I don't know why. I don't recall what / why I added it - but after adding it back to the kernel command line, things seem to be properly stable.

I've added the dmesg.txt from 7.0.12.

That being said, I also updated the BIOS to Version: H.R1.

I won't be physically with the system for the next day and a bit - so don't want to start playing too much unless I'm there to give it a kick if it doesn't come back.
 

Attachments

okay. please report back if the instability returns with iommu=pt!