Issues after upgrading to 6.17.4-1-pve

daviddanko

New Member
Apr 29, 2024
4
2
3
Today I upgraded the kernel to 6.17.4-1 from 6.17.2-2, and after rebooting, my server didn't turn on. Checking the logs it seemed that because of my media disk (rows sstarting with ata5)

r/Proxmox - Issues after upgrading to 6.17.4-1-pve
When I went back to the previous kernel, all seemed to be fine, but my truenas instance did not boot up.

When I unplugged and plugged in back my media disk, these errors were repeating:

Code:
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: cmd ca/00:10:10:12:40/00:00:00:00:00/e0 tag 22 dma 8192 out\
res 51/04:10:10:12:40/00:00:00:00:00/e0 Emask 0x1 (device error)\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: status: \{ DRDY ERR \}\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: error: \{ ABRT \}\
Dec 19 22:49:09 homeserver-01 kernel: ahci 10000:e0:17.0: port does not support device sleep\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: supports DRM functions and may not be fully accessible\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: failed to enable AA (error_mask=0x1)\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: supports DRM functions and may not be fully accessible\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: failed to enable AA (error_mask=0x1)\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: configured for UDMA/133 (device error ignored)\
Dec 19 22:49:09 homeserver-01 kernel: ahci 10000:e0:17.0: port does not support device sleep\
Dec 19 22:49:09 homeserver-01 kernel: ata5: EH complete\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: irq_stat 0x40000001\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: failed command: WRITE DMA EXT\
Dec 19 22:49:09 homeserver-01 kernel: ata5.00: cmd 35/00:10:10:84:e0/00:00:e8:00:00/e0 tag 23 dma 8192 out\ res 51/04:10:10:84:e0/00:00:e8:00:00/e0 Emask 0x1 (device error)\


To me it seemed like my disk just died. So I disabled my truenas VM, clicked on detach on the media disk in the VM options, removed the entry for the disk from fstab in proxmox, but it still doesn't boot with the latest kernel. It boots fine with 6.17.2-2 however. When I plug the disk in, the proxmox syslogs are showing this:

Code:
Dec 19 23:18:27 homeserver-01 kernel: sd 4:0:0:0: [sda] tag#21 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
Dec 19 23:18:27 homeserver-01 kernel: blk_print_req_error: 2 callbacks suppressed
Dec 19 23:18:27 homeserver-01 kernel: I/O error, dev sda, sector 0 op 0x0:(READ) flags 0x0 phys_seg 32 prio class 2
Dec 19 23:18:27 homeserver-01 kernel: sd 4:0:0:0: [sda] tag#14 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Dec 19 23:18:27 homeserver-01 kernel: sd 4:0:0:0: [sda] tag#14 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00

Perhaps it's worth mentioning that just before I also upgraded truenas to 25.10.1.

Currently, even though there should be no trace of that disk, the newest still doesn't boot:

r/Proxmox - Issues after upgrading to 6.17.4-1-pve r/Proxmox - Issues after upgrading to 6.17.4-1-pve
So my question is, did my media disk die? If so, and the new kernel was hanging because of the disk, why does the old kernel booted with the faulty disk? Why doesn't the new kernel boots even thought I removed, I believe, every reference to that disk?
 
What SATA controller / HBA are you using.?
I had to disable the Rombar (and upgrade the firmware) for the my controller to boot TrueNAS after upgrade from kernel 6.8.14
 
What SATA controller / HBA are you using.?
I had to disable the Rombar (and upgrade the firmware) for the my controller to boot TrueNAS after upgrade from kernel 6.8.14

Code:
root@homeserver-01:~# lspci -nnk | egrep -A3 -i 'sata|raid|sas|storage'
0000:00:0e.0 RAID bus controller [0104]: Intel Corporation Volume Management Device NVMe RAID Controller [8086:467f]
        Subsystem: Dell Device [1028:0be5]
        Kernel driver in use: vmd
        Kernel modules: vmd, ahci
--
10000:e0:17.0 SATA controller [0106]: Intel Corporation Alder Lake-S PCH SATA Controller [AHCI Mode] [8086:7ae2] (rev 11)
        Subsystem: Dell Device [1028:0be5]
        Kernel driver in use: ahci
        Kernel modules: ahci
So I believe I am not affected by that ROMbar issue, right?
 
So I believe I am not affected by that ROMbar issue, right?
I can't say that as I have no experience with those controllers and no idea what devices you have connected on each. I would try to disable the Rombar one device at a time. Then look into firmware updates, rolling back the kernel to the earlier version and if it still fails do further investigations from that point. There is no simple answer to these sort of cases, it's going to be diagnostics by trial and error.

It's fairly easy to diagnose if the drive is faulty, just connect in any other machine.