Continuous reset controller and disks on Dell T160 with perc H355

Fantu

Active Member
Jan 23, 2024
120
49
28
On new Dell T160 I installed latest version of proxmox (updated yesterday).
Yesterday I also migrated the fist vm (windows server 2019) and I noticed that during some operations where it has to operate on disks significantly it got to the point of blocking, in 2 cases the vm even crashed.
I found the cause in the host logs, reset of the controller/disks, here some example:
kernel: sd 0:0:2:0: [sdb] tag#388 BRCM Debug mfi stat 0x2d, data len requested/completed 0x800/0x0
kernel: sd 0:0:3:0: [sdc] tag#327 BRCM Debug mfi stat 0x2d, data len requested/completed 0x30000/0x0
kernel: sd 0:0:3:0: Power-on or device reset occurred
I also had issue on other megaraid controller of older servers and kernel 6.8 I solved adding these parameters "intel_iommu=on iommu=pt" or using kernel 6.5.
I first tried the parameters and also disabling pcie power management "pcie_aspm=off" (based on another search) but not solved.
I also tried to install and boot kernel 6.5 (more exactly 6.5.13-6-pve) but the issue persist and I don't found other things to try.
Controller firmware is already updated, disks don't seem to have issue, are Samsung SSD 870 EVO 1TB setted as JBOD on controller and with software raid1.
Has anyone had a similar problem and can tell me how to fix it or what to try?
 
https://www.t10.org/lists/asc-num.htm#ASC_29

If this happens regularly, basically your disk, controller or cables (or a combination of them) are bad. I would start with swapping out the disks, then the cables, then the controller. If you're using a carrier or a removable drive slot, make sure you're using proper SAS or Enterprise SATA hardware for that.

You're behind a RAID controller (megaraid) with budget grade hardware, try hanging with a short cable to the motherboard's SATA ports instead.
 
Last edited:
It was happened with all disks, some rare also to the system disks, I also thinked to the hardware issue with frame or sata but I think is not enough probable and more a software part.
The controller is low cost without cache (reason why I didn't use the hw raid) and the disks are consumer (the vm disks are on lvm thin and not on fs cow) as the budget is reduced being a small office where a maximum of 5 people work together and as resources it would be more than fine if it weren't for this problem.
On latest test I booted again to kernel 6.8 (since don't solved the issue) and tried to disabled ncq (with libata.force=noncq kernel parameter), for latest reboot no reset even I'm not still sure is solved.
 
The message emanates from your hardware, the kernel can't just generate a message like that, it is passing it through. If you use the MegaRAID CLI or iDRAC, you may see similar messages there (provided it has/keeps a log). It's a SCSI message, the bus is being reset for some reason, that's a hardware problem.

The first message is basically saying that the driver was expecting 0x800 bytes transfered to the device, but MegaRAID controller firmware reported only 0x0 bytes transferred.

Unless you are manually calling for a bus reset through a tool or plugging in and out hardware or resetting the controller somehow. My suggestion would be to start checking firmware for the disk, firmware for the controller, wires etc.

People are always on a budget, until they have down time and data loss. What does it cost you to have days of downtime and troubleshooting (what is your time worth)? That's how much you should spend on it. If downtime costs you nothing, chuck the machine.
 
Last edited:
I had the exact same issue on a Dell R620 with a perc H355 controller. All the advice I've found on the Internet was incorrect. The solution ended up being as simple as just disabling the built-in SATA controller in the BIOS setup (it has the options for AHCI/RAID/Disabled). It seems like when you use the perc controller in JBOD mode, Linux somehow gets confused between the disks connected to it and the onboard sata controller, and simply disabling the built-in sata controller makes everything work great.
 
Hello everyone! I have just bought a new Dell Server T160 with 2 disks in RAID-1 controlled by PERC H355 controller. I installed Proxmox 8.7 and 9.0.10, and I am facing the same issue.
After installing Proxmox, the system seems to work properly. I can install new VMs and restore backups, but a few minutes later, the VMs freeze and the disks disappear.
Also, I followed the high voltage procedure, and it didn't work.
The hardware is new, and other hypervisor works fine with this server; only Proxmox is facing this behavior.

If someone could help me, I would appreciate it.