I'm in the middle of troubleshooting this. I was doing a smart test on my nvme drive (Seagate FireCuda 510 SSD ZP500GM30001) when the drive went unresponsive.
dmesg showed errors like:
At this point, the system is just about dead in the water.
I tried quite a bit of troubleshooting, but even after a reboot, sometimes the NVMe drive wouldn't even be detected in the BIOS... The mainboards HDD activity light was just flashing at ~1s on, ~1s off.
After hard powering the system off, resetting the BIOS, and booting into kernel 7.0.6, everything seemed to work just fine. The BIOS NVMe self-test ran without issue. There's no data problems recorded in a `zfs scrub` of my zpool.
I did stumble across this: https://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg52267.html
I'm not sure if any of the problems are related. I've also never really seen anything like this before - so I'm kind of lost for ideas in troubleshooting.
Anyone seen something similar, or got any ideas on this one?
dmesg showed errors like:
Code:
[ 337.272885] nvme nvme0: I/O tag 757 (12f5) opcode 0x2 (I/O Cmd) QID 2 timeout, aborting req_op:READ(0) size:4096
[ 337.272901] nvme nvme0: I/O tag 681 (82a9) opcode 0x2 (I/O Cmd) QID 6 timeout, aborting req_op:READ(0) size:4096
[ 337.272910] nvme nvme0: I/O tag 896 (2380) opcode 0x2 (I/O Cmd) QID 8 timeout, aborting req_op:READ(0) size:4096
[ 339.060846] nvme nvme0: I/O tag 70 (3046) opcode 0x1 (I/O Cmd) QID 4 timeout, aborting req_op:WRITE(1) size:110592
[ 367.477530] nvme nvme0: I/O tag 757 (12f5) opcode 0x2 (I/O Cmd) QID 2 timeout, reset controller
At this point, the system is just about dead in the water.
I tried quite a bit of troubleshooting, but even after a reboot, sometimes the NVMe drive wouldn't even be detected in the BIOS... The mainboards HDD activity light was just flashing at ~1s on, ~1s off.
After hard powering the system off, resetting the BIOS, and booting into kernel 7.0.6, everything seemed to work just fine. The BIOS NVMe self-test ran without issue. There's no data problems recorded in a `zfs scrub` of my zpool.
I did stumble across this: https://www.mail-archive.com/acpi-bugzilla@lists.sourceforge.net/msg52267.html
I'm not sure if any of the problems are related. I've also never really seen anything like this before - so I'm kind of lost for ideas in troubleshooting.
Anyone seen something similar, or got any ideas on this one?