Proxmox 7.2 Upgrade Broke My RAID

Skillet · May 11, 2022

Update completed fine, but RAID became inaccessible on first reboot.

Code:

May 11 14:10:48 DL360-7 pvedaemon[1279]: <root@pam> successful auth for user 'root@pam'
May 11 14:10:50 DL360-7 kernel: [   95.195448] hpsa 0000:05:00.0: Controller lockup detected: 0xffff0000 after 30
May 11 14:10:50 DL360-7 kernel: [   95.195460] hpsa 0000:05:00.0: Telling controller to do a CHKPT
May 11 14:10:50 DL360-7 kernel: [   95.195510] hpsa 0000:05:00.0: controller lockup detected: LUN:0000004000000000 CDB:01040000000000000000000000000000
May 11 14:10:50 DL360-7 kernel: [   95.195518] hpsa 0000:05:00.0: Controller lockup detected during reset wait
May 11 14:10:50 DL360-7 kernel: [   95.195523] hpsa 0000:05:00.0: scsi 2:1:0:0: reset logical  failed Direct-Access     HP       LOGICAL VOLUME   RAID-5 SSDSmartPathCap- En- Exp=1
May 11 14:10:50 DL360-7 kernel: [   95.195537] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195541] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195544] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195549] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195549] hpsa 0000:05:00.0: failed 13 commands in fail_all
May 11 14:10:50 DL360-7 kernel: [   95.195550] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195552] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195554] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195555] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195557] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195558] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195560] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195561] sd 2:1:0:0: Device offlined - not ready after error recovery
May 11 14:10:50 DL360-7 kernel: [   95.195575] sd 2:1:0:0: [sdb] tag#967 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=74s
May 11 14:10:50 DL360-7 kernel: [   95.195579] sd 2:1:0:0: [sdb] tag#967 CDB: Read(10) 28 00 00 00 00 00 00 01 00 00
May 11 14:10:50 DL360-7 kernel: [   95.195580] blk_update_request: I/O error, dev sdb, sector 0 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.195887] sd 2:1:0:0: [sdb] tag#781 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:50 DL360-7 kernel: [   95.195890] sd 2:1:0:0: [sdb] tag#781 CDB: Write(10) 2a 00 00 01 29 38 00 00 08 00
May 11 14:10:50 DL360-7 kernel: [   95.195891] blk_update_request: I/O error, dev sdb, sector 76088 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.196193] Buffer I/O error on dev dm-8, logical block 9255, lost sync page write
May 11 14:10:50 DL360-7 kernel: [   95.196420] sd 2:1:0:0: [sdb] tag#880 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:50 DL360-7 kernel: [   95.196423] sd 2:1:0:0: [sdb] tag#880 CDB: Read(10) 28 00 00 00 28 08 00 00 08 00
May 11 14:10:50 DL360-7 kernel: [   95.196424] blk_update_request: I/O error, dev sdb, sector 10248 op 0x0:(READ) flags 0x83700 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.196484] EXT4-fs error (device dm-8): kmmpd:179: comm kmmpd-dm-8: Error writing to MMP block
May 11 14:10:50 DL360-7 kernel: [   95.196504] sd 2:1:0:0: rejecting I/O to offline device
May 11 14:10:50 DL360-7 kernel: [   95.196508] blk_update_request: I/O error, dev sdb, sector 8652800 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.196513] Buffer I/O error on dev dm-8, logical block 1081344, lost sync page write
May 11 14:10:50 DL360-7 kernel: [   95.196521] JBD2: Error -5 detected when updating journal superblock for dm-8-8.
May 11 14:10:50 DL360-7 kernel: [   95.196523] Aborting journal on device dm-8-8.
May 11 14:10:50 DL360-7 kernel: [   95.196545] blk_update_request: I/O error, dev sdb, sector 8652800 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.196549] Buffer I/O error on dev dm-8, logical block 1081344, lost sync page write
May 11 14:10:50 DL360-7 kernel: [   95.196553] JBD2: Error -5 detected when updating journal superblock for dm-8-8.
May 11 14:10:50 DL360-7 kernel: [   95.196729] sd 2:1:0:0: [sdb] tag#881 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:50 DL360-7 kernel: [   95.196993] blk_update_request: I/O error, dev sdb, sector 76088 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.197121] sd 2:1:0:0: [sdb] tag#881 CDB: Read(10) 28 00 00 40 08 08 00 00 20 00
May 11 14:10:50 DL360-7 kernel: [   95.197420] Buffer I/O error on dev dm-8, logical block 9255, lost sync page write
May 11 14:10:50 DL360-7 kernel: [   95.197639] blk_update_request: I/O error, dev sdb, sector 4196360 op 0x0:(READ) flags 0x83700 phys_seg 4 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.197876] blk_update_request: I/O error, dev sdb, sector 2048 op 0x1:(WRITE) flags 0x3800 phys_seg 1 prio class 0
May 11 14:10:50 DL360-7 kernel: [   95.197982] sd 2:1:0:0: [sdb] tag#882 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:50 DL360-7 kernel: [   95.198273] Buffer I/O error on dev dm-8, logical block 0, lost sync page write
May 11 14:10:50 DL360-7 kernel: [   95.198493] sd 2:1:0:0: [sdb] tag#882 CDB: Read(10) 28 00 00 40 08 60 00 00 18 00
May 11 14:10:50 DL360-7 kernel: [   95.198704] EXT4-fs (dm-8): I/O error while writing superblock
May 11 14:10:50 DL360-7 kernel: [   95.198997] blk_update_request: I/O error, dev sdb, sector 4196448 op 0x0:(READ) flags 0x83700 phys_seg 3 prio class 0
May 11 14:10:51 DL360-7 kernel: [   95.200497] sd 2:1:0:0: [sdb] tag#883 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:51 DL360-7 kernel: [   95.200499] sd 2:1:0:0: [sdb] tag#883 CDB: Read(10) 28 00 00 80 08 00 00 00 10 00
May 11 14:10:51 DL360-7 kernel: [   95.200501] blk_update_request: I/O error, dev sdb, sector 8390656 op 0x0:(READ) flags 0x83700 phys_seg 2 prio class 0
May 11 14:10:51 DL360-7 kernel: [   95.209562] sd 2:1:0:0: [sdb] tag#884 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:51 DL360-7 kernel: [   95.209565] sd 2:1:0:0: [sdb] tag#884 CDB: Read(10) 28 00 00 80 08 18 00 00 60 00
May 11 14:10:51 DL360-7 kernel: [   95.209571] sd 2:1:0:0: [sdb] tag#885 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:51 DL360-7 kernel: [   95.209574] sd 2:1:0:0: [sdb] tag#885 CDB: Read(10) 28 00 00 c0 08 00 00 00 18 00
May 11 14:10:51 DL360-7 kernel: [   95.209579] sd 2:1:0:0: [sdb] tag#886 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:51 DL360-7 kernel: [   95.209581] sd 2:1:0:0: [sdb] tag#886 CDB: Read(10) 28 00 00 c0 08 20 00 00 10 00
May 11 14:10:51 DL360-7 kernel: [   95.209586] sd 2:1:0:0: [sdb] tag#887 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=75s
May 11 14:10:51 DL360-7 kernel: [   95.209589] sd 2:1:0:0: [sdb] tag#887 CDB: Read(10) 28 00 00 c0 08 38 00 00 18 00
May 11 14:10:51 DL360-7 kernel: [   95.209612] EXT4-fs error (device dm-8): __ext4_find_entry:1612: inode #270714: comm systemd: reading directory lblock 0
May 11 14:10:51 DL360-7 kernel: [   95.209691] EXT4-fs error (device dm-8): ext4_wait_block_bitmap:531: comm ext4lazyinit: Cannot read block bitmap - block_group = 63, block_bitmap = 1572879
May 11 14:10:51 DL360-7 kernel: [   95.219413] Buffer I/O error on dev dm-8, logical block 0, lost sync page write
May 11 14:10:51 DL360-7 kernel: [   95.246873] EXT4-fs (dm-8): I/O error while writing superblock
May 11 14:10:51 DL360-7 kernel: [   95.256218] Buffer I/O error on dev dm-8, logical block 0, lost sync page write
May 11 14:10:51 DL360-7 kernel: [   95.265682] EXT4-fs (dm-8): I/O error while writing superblock
May 11 14:10:51 DL360-7 kernel: [   95.316526] fwbr103i0: port 2(veth103i0) entered disabled state
May 11 14:10:51 DL360-7 kernel: [   95.316634] device veth103i0 left promiscuous mode
May 11 14:10:51 DL360-7 kernel: [   95.316639] fwbr103i0: port 2(veth103i0) entered disabled state
May 11 14:10:51 DL360-7 kernel: [   95.811644] audit: type=1400 audit(1652271051.599:21): apparmor="STATUS" operation="profile_remove" profile="/usr/bin/lxc-start" name="lxc-103_</var/lib/lxc>" pid=1821 comm="apparmor_parser"
May 11 14:10:51 DL360-7 kernel: [   95.840186] EXT4-fs error (device dm-8): ext4_journal_check_start:83: comm lxc-start: Detected aborted journal
May 11 14:10:51 DL360-7 kernel: [   95.850094] Buffer I/O error on dev dm-8, logical block 0, lost sync page write
May 11 14:10:51 DL360-7 kernel: [   95.859939] EXT4-fs (dm-8): I/O error while writing superblock
May 11 14:10:51 DL360-7 kernel: [   95.869691] EXT4-fs (dm-8): Remounting filesystem read-only
May 11 14:10:51 DL360-7 pvestatd[1251]: unable to get PID for CT 103 (not running?)
May 11 14:10:51 DL360-7 pvestatd[1251]: status update time (77.484 seconds)
May 11 14:10:52 DL360-7 kernel: [   97.166754] Buffer I/O error on dev dm-8, logical block 9255, lost sync page write
May 11 14:10:53 DL360-7 kernel: [   97.256338] fwbr103i0: port 1(fwln103i0) entered disabled state
May 11 14:10:53 DL360-7 kernel: [   97.256461] vmbr0: port 2(fwpr103p0) entered disabled state
May 11 14:10:53 DL360-7 kernel: [   97.257012] device fwln103i0 left promiscuous mode
May 11 14:10:53 DL360-7 kernel: [   97.257020] fwbr103i0: port 1(fwln103i0) entered disabled state
May 11 14:10:53 DL360-7 kernel: [   97.275037] device fwpr103p0 left promiscuous mode
May 11 14:10:53 DL360-7 kernel: [   97.275040] vmbr0: port 2(fwpr103p0) entered disabled state

ness1602 · May 11, 2022

Buffer I/O error on dev dm-8 - this looks like a damaged disk. Check them all.

Skillet · May 11, 2022

ness1602 said:
Buffer I/O error on dev dm-8 - this looks like a damaged disk. Check them all.

Server Diagnostics is not detecting any issues with the drives or the controller.

Re-installed 7.1 and worked without any problem and controller and disks working fine with my data still intact.

Tried the upgrade again and same issue occurred.

This is on a DL360-Gen7 with Smart Array P410i controller

Upgrade working without issues on the other server. DL360-Gen8

Peppe · May 11, 2022

Skillet said:
Server Diagnostics is not detecting any issues with the drives or the controller.

Re-installed 7.1 and worked without any problem and controller and disks working fine with my data still intact.

Tried the upgrade again and same issue occurred.

This is on a DL360-Gen7 with Smart Array P410i controller

Upgrade working without issues on the other server. DL360-Gen8

I have the same problem in a test server ML330 G6 with Smart Array P410 controller.

With kernel 5.13.19-6-pve the raid works perfectly but is inaccessible with kernel 5.15.35-2 .

I also tried to add pvetest repository and kernel 5.15.35-3 but the problem persists:
https://forum.proxmox.com/threads/o...r-proxmox-ve-7-x-available.100936/post-470219

For now or fixed the boot at 5.13.19-6-pve with the command

Bash:

proxmox-boot-tool kernel pin 5.13.19-6-pve

Peppe · May 11, 2022

I add that with both kernels if i run

root@pve:~# dmidecode | grep -A3 '^System Information'

System Information
Manufacturer: HP
Product Name: ProLiant ML330 G6
Version: Not Specified

and

root@pve:~# lspci -k|grep -i -A2 raid

04:00.0 RAID bus controller: Hewlett-Packard Company Smart Array G6 controllers (rev 01)
Subsystem: Hewlett-Packard Company Smart Array P410
Kernel driver in use: hpsa

but with kernel 5.15.35 :

root@pve:~# ssacli ctrl all show config

Smart Array P410 (Error: Not responding)

root@pve:~# ssacli ctrl all show status

Error: Cannot show status for this device.

in reverse with kernel 5.13.19-6 :

root@pve:~# ssacli ctrl all show config

Smart Array P410 in Slot 3

Internal Drive Cage at Port 1I, Box 1, OK

Port Name: 1I

Port Name: 2I

Array A (SAS, Unused Space: 0 MB)

logicaldrive 1 (3.64 TB, RAID 1+0, OK)

physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS HDD, 2 TB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS HDD, 2 TB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS HDD, 2 TB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS HDD, 2 TB, OK)

root@pve:~# ssacli ctrl all show status

Smart Array P410 in Slot 3
Controller Status: OK

LnxBil · May 12, 2022

Have you tried upgrading your smart array controller firmware?

Peppe · May 12, 2022

LnxBil said:
Have you tried upgrading your smart array controller firmware?

Thank you for your interest.

I have the latest firmware version for the Smart Array P410 controller:

https://support.hpe.com/connect/s/s...one&softwareId=MTX_5e52f965d84f41c2bb65d33b58

and I have the latest System ROMPaq Firmware for the ML330 G6 (W07) Servers
https://support.hpe.com/connect/s/s...037790d01fb4f7b885cd45e7c&tab=revisionHistory

I add that I tried to install Proxmox VE 7.2 from USB Key again and the installer finds the RAID LOGICAL_VOLUME but after starting the installation it freezes when creating partitions.

Stoiko Ivanov · May 12, 2022

Could you please try to disable intel_iommu and reboot?
(the default changed to on with the kernel 5.15 series, and 7.1 still had it to off)

just hit 'e' in the boot loader and append 'intel_iommu=off' after the kernel image line (the one that also does contain a 'root=' part)

theCiscoGeek · May 12, 2022

I'm going to chime in and +1 this issue. Different hardware, but it interacted very badly with the raid controller. Lenovo sent a replacement, after first boot that one died too. In the case of Lenovo's 530-8i raid controller, it locks into a failsafe mode that isn't customer resolvable. Pending reply from Lenovo to see if they'll replace it a second time, but it was definitely a shock to have proxmox cause a raid controller to go into a permanent failsafe mode. I'm frustrated with the Lenovo side because of how their controller interacts with the OS and doesn't have a recovery path once the issue is triggered. However, that being said I wasn't expecting an update through the proxmox subscription repo to cause issues like this. Since Proxmox isn't on the compatibility list and triggered the controller issues, I might be stuck fronting the cost of a new controller.

Stoiko Ivanov · May 12, 2022

theCiscoGeek said:
I'm going to chime in and +1 this issue. Different hardware,

maybe a different issue then...

* please provide the logs of a boot showing the errors (feel free to open a new thread)
* make sure you have the latest firmware updates installed for all components of the system (bios, nic firmware, controller firmware ....)
* make sure to have installed the intel-microcode or amd-microcode package

ness1602 · May 12, 2022

I have that controller:
Subsystem: Lenovo ThinkSystem RAID 530-8i PCIe 12Gb Adapter
On two machines, and everything works okay with 7.2 and hardware raid(no zfs).

Peppe · May 12, 2022

Stoiko Ivanov said:
Could you please try to disable intel_iommu and reboot?
(the default changed to on with the kernel 5.15 series, and 7.1 still had it to off)

just hit 'e' in the boot loader and append 'intel_iommu=off' after the kernel image line (the one that also does contain a 'root=' part)

Thank you for your interest.

I did as you said and everything works.
The controller and RAID volume are up and working.

So I edited /etc/default/grub with

root@pve:~# nano /etc/default/grub

and changed GRUB_CMDLINE_LINUX_DEFAULT="quiet"
to GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=off"

and I updated grub with

root@pve:~# update-grub

lastly I ran the command

root@pve:~# reboot now

can that be enough, or should I do more?

I thank you for allowing me to give my very old home server a few more years of life.

lukeallister · May 13, 2022

I also had this problem: RAID broke with the 7.2 upgrade.

Adding 'intel_iommu=off' to my GRUB_CMDLINE_LINUX_DEFAULT followed by `update-grub` and a reboot worked.

Thanks everyone for the guidance on this fix.

theCiscoGeek · May 13, 2022

Stoiko Ivanov said:
maybe a different issue then...

* please provide the logs of a boot showing the errors (feel free to open a new thread)
* make sure you have the latest firmware updates installed for all components of the system (bios, nic firmware, controller firmware ....)
* make sure to have installed the intel-microcode or amd-microcode package

Currently waiting on a new card from Lenovo. I'm not sure if I feel brave enough to try it again with a new card.
* make sure you have the latest firmware updates installed for all components of the system (bios, nic firmware, controller firmware ....)
* * Yep! I was running the latest available revision of firmware on all devices.

The last update consisted of:

Code:

Start-Date: 2022-05-09  21:04:49
Commandline: apt-get dist-upgrade
Install: pve-kernel-5.15.30-2-pve:amd64 (5.15.30-3, automatic), libgbm1:amd64 (20.3.5-1, automatic), libwayland-server0:amd64 (1.18.0-2~exp1.1, automatic), libdrm-common:amd64 (2.4.104-1, automatic), libdrm2:amd64 (2.4.104-1, automatic), pve-kernel-5.15:amd64 (7.2-1, automatic), libvirglrenderer1:amd64 (0.8.2-5, automatic), libepoxy0:amd64 (1.5.5-1, automatic)

Upgrade: pve-docs:amd64 (7.1-2, 7.2-2), proxmox-widget-toolkit:amd64 (3.4-7, 3.4-10), libpve-rs-perl:amd64 (0.6.0, 0.6.1), pve-firmware:amd64 (3.3-6, 3.4-1), zfs-zed:amd64 (2.1.2-pve1, 2.1.4-pve1), zfs-initramfs:amd64 (2.1.2-pve1, 2.1.4-pve1), liblzma5:amd64 (5.2.5-2, 5.2.5-2.1~deb11u1), spl:amd64 (2.1.2-pve1, 2.1.4-pve1), pve-qemu-kvm:amd64 (6.1.1-2, 6.2.0-5), libnvpair3linux:amd64 (2.1.2-pve1, 2.1.4-pve1), libproxmox-acme-perl:amd64 (1.4.1, 1.4.2), libpve-cluster-api-perl:amd64 (7.1-3, 7.2-1), pve-ha-manager:amd64 (3.3-3, 3.3-4), lxcfs:amd64 (4.0.11-pve1, 4.0.12-pve1), libuutil3linux:amd64 (2.1.2-pve1, 2.1.4-pve1), libpve-storage-perl:amd64 (7.1-1, 7.2-2), libzpool5linux:amd64 (2.1.2-pve1, 2.1.4-pve1), libpve-guest-common-perl:amd64 (4.1-1, 4.1-2), pve-cluster:amd64 (7.1-3, 7.2-1), xz-utils:amd64 (5.2.5-2, 5.2.5-2.1~deb11u1), proxmox-ve:amd64 (7.1-1, 7.2-1), lxc-pve:amd64 (4.0.11-1, 4.0.12-1), novnc-pve:amd64 (1.3.0-2, 1.3.0-3), proxmox-backup-file-restore:amd64 (2.1.5-1, 2.1.8-1), qemu-server:amd64 (7.1-4, 7.2-2), libpve-access-control:amd64 (7.1-6, 7.1-8), pve-container:amd64 (4.1-4, 4.2-1), libproxmox-acme-plugins:amd64 (1.4.1, 1.4.2), pve-i18n:amd64 (2.6-2, 2.7-1), gzip:amd64 (1.10-4, 1.10-4+deb11u1), proxmox-backup-client:amd64 (2.1.5-1, 2.1.8-1), smartmontools:amd64 (7.2-1, 7.2-pve3), pve-kernel-5.13.19-6-pve:amd64 (5.13.19-14, 5.13.19-15), pve-manager:amd64 (7.1-11, 7.2-3), libpve-common-perl:amd64 (7.1-5, 7.1-6), libzfs4linux:amd64 (2.1.2-pve1, 2.1.4-pve1), zlib1g:amd64 (1:1.2.11.dfsg-2, 1:1.2.11.dfsg-2+deb11u1), libpve-u2f-server-perl:amd64 (1.1-1, 1.1-2), pve-kernel-helper:amd64 (7.1-13, 7.2-2), zfsutils-linux:amd64 (2.1.2-pve1, 2.1.4-pve1), libpve-cluster-perl:amd64 (7.1-3, 7.2-1)
End-Date: 2022-05-09  21:06:08

I have that controller:
Subsystem: Lenovo ThinkSystem RAID 530-8i PCIe 12Gb Adapter
On two machines, and everything works okay with 7.2 and hardware raid(no zfs).

Maybe a difference in implementation? I had two vd's on the controller - one for the OS and another for the VM's.

I'm sure that the update was the culprit because the card was operational prior to booting into PVE (multiple reboots to a separate centos installation for troubleshooting and firmware updating) and the previous one died on the first boot after the update. After having booted into the latest PVE, this Lenovo fault happened (failure screenshot attached) and shows at each boot. Many operations on the controller appear to be locked when it's in a failed state.
* When booting the latest PVE entry, it failed each time to boot.
* Selecting the previous PVE kernel at the bootloader allowed it to boot, though the controller remained borked.
The array was wiped during the troubleshooting process. If I find the time, I'll re-visit PVE and try intel_iommu=off with the latest kernel entry in the bootloader to see if the latest one boots with iommu disabled.

theCiscoGeek · May 13, 2022

Re-created issue.
*Created 100gb vd for OS, second vd of remaining space for vm's. It seems that although some functionality didn't work through the bmc, I was able to boot into UEFI configuration on the host to make raid changes. Though, the controller is still in a failed state.
* Installed from proxmox-ve_7.1-2.iso
* Booted successfully
* Registered subscription and updated
* No longer boots after installing updates
intel_iommu=off seems to allow the server to boot. In dmesg, there are multiple fault instances along the lines of:

Code:

[    1.937514] ================================================================================
[    1.937515] UBSAN: array-index-out-of-bounds in drivers/scsi/megaraid/megaraid_sas_fp.c:125:9
[    1.937517] index 1 is out of range for type 'MR_LD_SPAN_MAP [1]'
[    1.937517] CPU: 6 PID: 188 Comm: kworker/6:1H Not tainted 5.15.35-1-pve #1
[    1.937518] Hardware name: Lenovo ThinkSystem ST250 <line truncated>
[    1.937519] Workqueue: kblockd blk_mq_run_work_fn
[    1.937520] Call Trace:
[    1.937520]  <TASK>
[    1.937520]  dump_stack_lvl+0x4a/0x5f
[    1.937521]  dump_stack+0x10/0x12
[    1.937522]  ubsan_epilogue+0x9/0x45
[    1.937523]  __ubsan_handle_out_of_bounds.cold+0x44/0x49
[    1.937525]  get_updated_dev_handle+0x2da/0x350 [megaraid_sas]
[    1.937528]  megasas_build_and_issue_cmd_fusion+0x160a/0x17e0 [megaraid_sas]
[    1.937532]  megasas_queue_command+0x1bf/0x200 [megaraid_sas]
[    1.937536]  scsi_queue_rq+0x3da/0xbe0
[    1.937536]  blk_mq_dispatch_rq_list+0x139/0x800
[    1.937538]  ? sbitmap_get+0xb4/0x1e0
[    1.937539]  ? sbitmap_get+0x1c1/0x1e0
[    1.937541]  blk_mq_do_dispatch_sched+0x2fa/0x340
[    1.937543]  __blk_mq_sched_dispatch_requests+0x101/0x150
[    1.937544]  blk_mq_sched_dispatch_requests+0x35/0x60
[    1.937546]  __blk_mq_run_hw_queue+0x34/0xb0
[    1.937546]  blk_mq_run_work_fn+0x1b/0x20
[    1.937547]  process_one_work+0x228/0x3d0
[    1.937548]  worker_thread+0x53/0x410
[    1.937549]  ? process_one_work+0x3d0/0x3d0
[    1.937550]  kthread+0x127/0x150
[    1.937552]  ? set_kthread_struct+0x50/0x50
[    1.937553]  ret_from_fork+0x1f/0x30
[    1.937555]  </TASK>
[    1.937555] ================================================================================

Pre-updates, there is no such fault.

theCiscoGeek · May 13, 2022

sharing dmesg from failed boot

Stoiko Ivanov · May 13, 2022

Peppe said:
I thank you for allowing me to give my very old home server a few more years of life.

not me doing that - that's the linux kernel and the developers commitment not to break stuff on purpose - but very glad this worked for you!

spirit · May 18, 2022

I have same problem with dell servers and perc megaraid card (lsi based)
"UBSAN: array-index-out-of-bounds in drivers/scsi/megaraid/megaraid_sas_fp.c:125:9"

https://bugzilla.kernel.org/show_bug.cgi?id=215943

Stoiko Ivanov · May 18, 2022

spirit said:
I have same problem with dell servers and perc megaraid card (lsi based)
"UBSAN: array-index-out-of-bounds in drivers/scsi/megaraid/megaraid_sas_fp.c:125:9"

https://bugzilla.kernel.org/show_bug.cgi?id=215943

thanks for digging this up.

does this issue affect the system in any way apart from the log messages?
(maybe I'm missing something but I think ubsan - https://docs.kernel.org/dev-tools/ubsan.html generates these reports when it finds unaligned access being executed - and the kernel.org bugzilla entry also speaks to it being a 10 year old omission

in any case - the 5.13 kernel series did not have UBSAN enabled:

Code:

grep -i ubsan config-5.13.19-6-pve
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_UBSAN is not set

the 5.15 does:

Code:

 grep -i ubsan config-5.15.35-1-pve
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
CONFIG_UBSAN=y
# CONFIG_UBSAN_TRAP is not set
CONFIG_CC_HAS_UBSAN_BOUNDS=y
CONFIG_UBSAN_BOUNDS=y
CONFIG_UBSAN_ONLY_BOUNDS=y
CONFIG_UBSAN_SHIFT=y
# CONFIG_UBSAN_DIV_ZERO is not set
CONFIG_UBSAN_BOOL=y
CONFIG_UBSAN_ENUM=y
# CONFIG_UBSAN_ALIGNMENT is not set
CONFIG_UBSAN_SANITIZE_ALL=y
# CONFIG_TEST_UBSAN is not set

so this should explain why it's showing up with the 5.15 and not with the 5.13 series

spirit · May 18, 2022

Stoiko Ivanov said:
thanks for digging this up.

does this issue affect the system in any way apart from the log messages?

I just see this message when installing a new dell server for a customer with 7.2 iso installer.
I didn't have time to test, so I have aborded the installation by security. So I have installed with 7.1 iso, updated to 7.2 and pinned to 5.13.

This was with 8 non-raid ssd with a perc H755
I'll try to test on a spare Dell server tomorrow.

Proxmox 7.2 Upgrade Broke My RAID

Member

Renowned Member

Member

New Member

New Member

Distinguished Member

New Member

Proxmox Staff Member

New Member

Proxmox Staff Member

Renowned Member

New Member

New Member

New Member

Attachments

New Member

Attachments

New Member

Attachments

Proxmox Staff Member

Distinguished Member

Proxmox Staff Member

Distinguished Member

Attachments